New York City Yellow Taxi Data¶
Objective¶
In this case study you will be learning exploratory data analysis (EDA) with the help of a dataset on yellow taxi rides in New York City. This will enable you to understand why EDA is an important step in the process of data science and machine learning.
Problem Statement¶
As an analyst at an upcoming taxi operation in NYC, you are tasked to use the 2023 taxi trip data to uncover insights that could help optimise taxi operations. The goal is to analyse patterns in the data that can inform strategic decisions to improve service efficiency, maximise revenue, and enhance passenger experience.
Tasks¶
You need to perform the following steps for successfully completing this assignment:
- Data Loading
- Data Cleaning
- Exploratory Analysis: Bivariate and Multivariate
- Creating Visualisations to Support the Analysis
- Deriving Insights and Stating Conclusions
NOTE: The marks given along with headings and sub-headings are cumulative marks for those particular headings/sub-headings.
The actual marks for each task are specified within the tasks themselves.
For example, marks given with heading 2 or sub-heading 2.1 are the cumulative marks, for your reference only.
The marks you will receive for completing tasks are given with the tasks.
Suppose the marks for two tasks are: 3 marks for 2.1.1 and 2 marks for 3.2.2, or then, you will earn 3 marks for completing task 2.1.1 and 2 marks for completing task 3.2.2.
- 2.1.1 [3 marks]
- 3.2.2 [2 marks]
Data Understanding¶
The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
The data is stored in Parquet format (.parquet). The dataset is from 2009 to 2024. However, for this assignment, we will only be using the data from 2023.
The data for each month is present in a different parquet file. You will get twelve files for each of the months in 2023.
The data was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers like vendors and taxi hailing apps.
You can find the link to the TLC trip records page here: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Data Description¶
You can find the data description here: Data Dictionary
Trip Records
| Field Name | description |
|---|---|
| VendorID | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. |
| tpep_pickup_datetime | The date and time when the meter was engaged. |
| tpep_dropoff_datetime | The date and time when the meter was disengaged. |
| Passenger_count | The number of passengers in the vehicle. This is a driver-entered value. |
| Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
| PULocationID | TLC Taxi Zone in which the taximeter was engaged |
| DOLocationID | TLC Taxi Zone in which the taximeter was disengaged |
| RateCodeID | The final rate code in effect at the end of the trip. 1 = Standard rate 2 = JFK 3 = Newark 4 = Nassau or Westchester 5 = Negotiated fare 6 = Group ride |
| Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip |
| Payment_type | A numeric code signifying how the passenger paid for the trip. 1 = Credit card 2 = Cash 3 = No charge 4 = Dispute 5 = Unknown 6 = Voided trip |
| Fare_amount | The time-and-distance fare calculated by the meter. Extra Miscellaneous extras and surcharges. Currently, this only includes the 0.50 and 1 USD rush hour and overnight charges. |
| MTA_tax | 0.50 USD MTA tax that is automatically triggered based on the metered rate in use. |
| Improvement_surcharge | 0.30 USD improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
| Tolls_amount | Total amount of all tolls paid in trip. |
| total_amount | The total amount charged to passengers. Does not include cash tips. |
| Congestion_Surcharge | Total amount collected in trip for NYS congestion surcharge. |
| Airport_fee | 1.25 USD for pick up only at LaGuardia and John F. Kennedy Airports |
Although the amounts of extra charges and taxes applied are specified in the data dictionary, you will see that some cases have different values of these charges in the actual data.
Taxi Zones
Each of the trip records contains a field corresponding to the location of the pickup or drop-off of the trip, populated by numbers ranging from 1-263.
These numbers correspond to taxi zones, which may be downloaded as a table or map/shapefile and matched to the trip records using a join.
This is covered in more detail in later sections.
1 Data Preparation¶
[5 marks]
Import Libraries¶
Requirement already satisfied: numpy in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (1.26.4)
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (2.2.2) Requirement already satisfied: numpy>=1.23.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas) (2024.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas) (2024.2) Requirement already satisfied: six>=1.5 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: matplotlib in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (3.10.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (4.55.3) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (1.4.8) Requirement already satisfied: numpy>=1.23 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (1.26.4) Requirement already satisfied: packaging>=20.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=8 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (3.2.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: seaborn in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from seaborn) (3.10.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.55.3) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.8) Requirement already satisfied: packaging>=20.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.1) Requirement already satisfied: pillow>=8 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.2->seaborn) (2024.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.2->seaborn) (2024.2) Requirement already satisfied: six>=1.5 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
1.1 Load the dataset¶
[5 marks]
You will see twelve files, one for each month.
To read parquet files with Pandas, you have to follow a similar syntax as that for CSV files.
df = pd.read_parquet('file.parquet')
Collecting pyarrow Downloading pyarrow-19.0.0-cp311-cp311-win_amd64.whl.metadata (3.4 kB) Downloading pyarrow-19.0.0-cp311-cp311-win_amd64.whl (25.3 MB) ---------------------------------------- 0.0/25.3 MB ? eta -:--:-- ---------------------------------------- 0.0/25.3 MB ? eta -:--:-- ---------------------------------------- 0.0/25.3 MB 653.6 kB/s eta 0:00:39 ---------------------------------------- 0.3/25.3 MB 3.2 MB/s eta 0:00:08 - -------------------------------------- 0.7/25.3 MB 4.8 MB/s eta 0:00:06 - -------------------------------------- 1.1/25.3 MB 5.6 MB/s eta 0:00:05 -- ------------------------------------- 1.5/25.3 MB 6.3 MB/s eta 0:00:04 -- ------------------------------------- 1.9/25.3 MB 6.6 MB/s eta 0:00:04 --- ------------------------------------ 2.4/25.3 MB 7.1 MB/s eta 0:00:04 ---- ----------------------------------- 2.7/25.3 MB 7.2 MB/s eta 0:00:04 ---- ----------------------------------- 3.1/25.3 MB 7.4 MB/s eta 0:00:04 ----- ---------------------------------- 3.6/25.3 MB 7.7 MB/s eta 0:00:03 ------ --------------------------------- 4.1/25.3 MB 7.9 MB/s eta 0:00:03 ------ --------------------------------- 4.4/25.3 MB 7.8 MB/s eta 0:00:03 ------- -------------------------------- 4.9/25.3 MB 8.0 MB/s eta 0:00:03 -------- ------------------------------- 5.3/25.3 MB 8.1 MB/s eta 0:00:03 --------- ------------------------------ 5.7/25.3 MB 8.1 MB/s eta 0:00:03 --------- ------------------------------ 6.2/25.3 MB 8.2 MB/s eta 0:00:03 ---------- ----------------------------- 6.7/25.3 MB 8.3 MB/s eta 0:00:03 ----------- ---------------------------- 7.1/25.3 MB 8.4 MB/s eta 0:00:03 ------------ --------------------------- 7.6/25.3 MB 8.5 MB/s eta 0:00:03 ------------ --------------------------- 8.2/25.3 MB 8.7 MB/s eta 0:00:02 ------------- -------------------------- 8.7/25.3 MB 8.8 MB/s eta 0:00:02 -------------- ------------------------- 9.2/25.3 MB 8.9 MB/s eta 0:00:02 --------------- ------------------------ 9.8/25.3 MB 9.0 MB/s eta 0:00:02 ---------------- ----------------------- 10.3/25.3 MB 9.6 MB/s eta 0:00:02 ----------------- ---------------------- 10.9/25.3 MB 9.9 MB/s eta 0:00:02 ------------------ --------------------- 11.4/25.3 MB 10.1 MB/s eta 0:00:02 ------------------ --------------------- 12.0/25.3 MB 10.2 MB/s eta 0:00:02 ------------------- -------------------- 12.5/25.3 MB 10.4 MB/s eta 0:00:02 -------------------- ------------------- 13.0/25.3 MB 10.6 MB/s eta 0:00:02 --------------------- ------------------ 13.5/25.3 MB 10.6 MB/s eta 0:00:02 ---------------------- ----------------- 14.1/25.3 MB 10.7 MB/s eta 0:00:02 ---------------------- ----------------- 14.5/25.3 MB 10.9 MB/s eta 0:00:01 ----------------------- ---------------- 15.2/25.3 MB 11.1 MB/s eta 0:00:01 ------------------------ --------------- 15.7/25.3 MB 11.3 MB/s eta 0:00:01 ------------------------- -------------- 16.2/25.3 MB 11.3 MB/s eta 0:00:01 -------------------------- ------------- 16.8/25.3 MB 11.3 MB/s eta 0:00:01 --------------------------- ------------ 17.3/25.3 MB 11.5 MB/s eta 0:00:01 ---------------------------- ----------- 17.9/25.3 MB 11.5 MB/s eta 0:00:01 ----------------------------- ---------- 18.4/25.3 MB 11.5 MB/s eta 0:00:01 ----------------------------- ---------- 18.9/25.3 MB 11.5 MB/s eta 0:00:01 ------------------------------ --------- 19.4/25.3 MB 11.5 MB/s eta 0:00:01 ------------------------------- -------- 19.9/25.3 MB 11.3 MB/s eta 0:00:01 -------------------------------- ------- 20.4/25.3 MB 11.3 MB/s eta 0:00:01 --------------------------------- ------ 20.9/25.3 MB 11.3 MB/s eta 0:00:01 ---------------------------------- ----- 21.5/25.3 MB 11.3 MB/s eta 0:00:01 ---------------------------------- ----- 22.0/25.3 MB 11.5 MB/s eta 0:00:01 ----------------------------------- ---- 22.4/25.3 MB 11.3 MB/s eta 0:00:01 ----------------------------------- ---- 22.7/25.3 MB 11.1 MB/s eta 0:00:01 ------------------------------------ --- 23.2/25.3 MB 11.1 MB/s eta 0:00:01 ------------------------------------- -- 23.8/25.3 MB 11.1 MB/s eta 0:00:01 -------------------------------------- - 24.3/25.3 MB 11.1 MB/s eta 0:00:01 --------------------------------------- 24.9/25.3 MB 11.1 MB/s eta 0:00:01 --------------------------------------- 25.3/25.3 MB 10.9 MB/s eta 0:00:01 --------------------------------------- 25.3/25.3 MB 10.9 MB/s eta 0:00:01 ---------------------------------------- 25.3/25.3 MB 10.2 MB/s eta 0:00:00 Installing collected packages: pyarrow Successfully installed pyarrow-19.0.0
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
Collecting fastparquet Downloading fastparquet-2024.11.0-cp311-cp311-win_amd64.whl.metadata (4.3 kB) Requirement already satisfied: pandas>=1.5.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from fastparquet) (2.2.2) Requirement already satisfied: numpy in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from fastparquet) (1.26.4) Collecting cramjam>=2.3 (from fastparquet) Downloading cramjam-2.9.1-cp311-cp311-win_amd64.whl.metadata (5.0 kB) Collecting fsspec (from fastparquet) Downloading fsspec-2025.2.0-py3-none-any.whl.metadata (11 kB) Requirement already satisfied: packaging in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from fastparquet) (23.1) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.0->fastparquet) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.0->fastparquet) (2024.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.0->fastparquet) (2024.2) Requirement already satisfied: six>=1.5 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.5.0->fastparquet) (1.16.0) Downloading fastparquet-2024.11.0-cp311-cp311-win_amd64.whl (671 kB) ---------------------------------------- 0.0/671.0 kB ? eta -:--:-- --- ------------------------------------ 51.2/671.0 kB 1.3 MB/s eta 0:00:01 ------------------------ --------------- 409.6/671.0 kB 5.1 MB/s eta 0:00:01 ---------------------------------------- 671.0/671.0 kB 5.3 MB/s eta 0:00:00 Downloading cramjam-2.9.1-cp311-cp311-win_amd64.whl (2.1 MB) ---------------------------------------- 0.0/2.1 MB ? eta -:--:-- ----- ---------------------------------- 0.3/2.1 MB 9.2 MB/s eta 0:00:01 ------------- -------------------------- 0.7/2.1 MB 8.9 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------- -------------------- 1.0/2.1 MB 9.4 MB/s eta 0:00:01 ------------------------ --------------- 1.3/2.1 MB 1.3 MB/s eta 0:00:01 --------------------------------- ------ 1.8/2.1 MB 1.7 MB/s eta 0:00:01 ---------------------------------------- 2.1/2.1 MB 1.9 MB/s eta 0:00:00 Downloading fsspec-2025.2.0-py3-none-any.whl (184 kB) ---------------------------------------- 0.0/184.5 kB ? eta -:--:-- --------------------------------------- 184.5/184.5 kB 11.6 MB/s eta 0:00:00 Installing collected packages: fsspec, cramjam, fastparquet Successfully installed cramjam-2.9.1 fastparquet-2024.11.0 fsspec-2025.2.0
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
<class 'pandas.core.frame.DataFrame'> Index: 3041714 entries, 0 to 3066765 Data columns (total 19 columns): # Column Dtype --- ------ ----- 0 VendorID int64 1 tpep_pickup_datetime datetime64[us] 2 tpep_dropoff_datetime datetime64[us] 3 passenger_count float64 4 trip_distance float64 5 RatecodeID float64 6 store_and_fwd_flag object 7 PULocationID int64 8 DOLocationID int64 9 payment_type int64 10 fare_amount float64 11 extra float64 12 mta_tax float64 13 tip_amount float64 14 tolls_amount float64 15 improvement_surcharge float64 16 total_amount float64 17 congestion_surcharge float64 18 airport_fee float64 dtypes: datetime64[us](2), float64(12), int64(4), object(1) memory usage: 464.1+ MB
(3041714, 19)
How many rows are there? Do you think handling such a large number of rows is computationally feasible when we have to combine the data for all twelve months into one?
To handle this, we need to sample a fraction of data from each of the files. How to go about that? Think of a way to select only some portion of the data from each month's file that accurately represents the trends.
Sampling the Data¶
One way is to take a small percentage of entries for pickup in every hour of a date. So, for all the days in a month, we can iterate through the hours and select 5% values randomly from those. Use
tpep_pickup_datetimefor this. Separate date and hour from the datetime values and then for each date, select some fraction of trips for each of the 24 hours.
To sample data, you can use the sample() method. Follow this syntax:
# sampled_data is an empty DF to keep appending sampled data of each hour
# hour_data is the DF of entries for an hour 'X' on a date 'Y'
sample = hour_data.sample(frac = 0.05, random_state = 42)
# sample 0.05 of the hour_data
# random_state is just a seed for sampling, you can define it yourself
sampled_data = pd.concat([sampled_data, sample]) # adding data for this hour to the DF
This sampled_data will contain 5% values selected at random from each hour.
Note that the code given above is only the part that will be used for sampling and not the complete code required for sampling and combining the data files.
Keep in mind that you sample by date AND hour, not just hour. (Why?)
1.1.1 [5 marks]
Figure out how to sample and combine the files.
Note: It is not mandatory to use the method specified above. While sampling, you only need to make sure that your sampled data represents the overall data of all the months accurately.
Processed 2023-1.parquet, Current total records: 21288 Processed 2023-10.parquet, Current total records: 45696 Processed 2023-11.parquet, Current total records: 68811 Processed 2023-12.parquet, Current total records: 92163 Processed 2023-2.parquet, Current total records: 115780 Processed 2023-3.parquet, Current total records: 138719 Processed 2023-4.parquet, Current total records: 158265 Processed 2023-5.parquet, Current total records: 178473 Processed 2023-6.parquet, Current total records: 201283 Processed 2023-7.parquet, Current total records: 225646 Processed 2023-8.parquet, Current total records: 245763 Processed 2023-9.parquet, Current total records: 265487 Final number of records: 265487
After combining the data files into one DataFrame, convert the new DataFrame to a CSV or parquet file and store it to use directly.
Ideally, you can try keeping the total entries to around 250,000 to 300,000.
'E:\\MS-IIITB-LJMU\\EDA - Assignment'
Data saved as: 1. E:\MS-IIITB-LJMU\EDA - Assignment\sampled_data.parquet 2. E:\MS-IIITB-LJMU\EDA - Assignment\sampled_data.csv
2 Data Cleaning¶
[30 marks]
Now we can load the new data directly.
(265487, 22)
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | date | hour | Airport_fee | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:07:18 | 2023-01-01 00:23:15 | 1.0 | 7.74 | 1.0 | N | 138 | 256 | 2 | 32.40 | 6.0 | 0.5 | 0.00 | 0.0 | 1.0 | 41.15 | 0.0 | 1.25 | 2023-01-01 | 0 | NaN |
| 1 | 2 | 2023-01-01 00:16:41 | 2023-01-01 00:21:46 | 2.0 | 1.24 | 1.0 | N | 161 | 237 | 1 | 7.90 | 1.0 | 0.5 | 2.58 | 0.0 | 1.0 | 15.48 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 2 | 2 | 2023-01-01 00:14:03 | 2023-01-01 00:24:36 | 3.0 | 1.44 | 1.0 | N | 237 | 141 | 2 | 11.40 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 16.40 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 3 | 2 | 2023-01-01 00:24:30 | 2023-01-01 00:29:55 | 1.0 | 0.54 | 1.0 | N | 143 | 142 | 2 | 6.50 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 11.50 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 4 | 2 | 2023-01-01 00:43:00 | 2023-01-01 01:01:00 | NaN | 19.24 | NaN | None | 66 | 107 | 0 | 25.64 | 0.0 | 0.5 | 5.93 | 0.0 | 1.0 | 35.57 | NaN | NaN | 2023-01-01 | 0 | NaN |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 265487 entries, 0 to 265486 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 265487 non-null int64 1 tpep_pickup_datetime 265487 non-null datetime64[us] 2 tpep_dropoff_datetime 265487 non-null datetime64[us] 3 passenger_count 256656 non-null float64 4 trip_distance 265487 non-null float64 5 RatecodeID 256656 non-null float64 6 store_and_fwd_flag 256656 non-null object 7 PULocationID 265487 non-null int64 8 DOLocationID 265487 non-null int64 9 payment_type 265487 non-null int64 10 fare_amount 265487 non-null float64 11 extra 265487 non-null float64 12 mta_tax 265487 non-null float64 13 tip_amount 265487 non-null float64 14 tolls_amount 265487 non-null float64 15 improvement_surcharge 265487 non-null float64 16 total_amount 265487 non-null float64 17 congestion_surcharge 256656 non-null float64 18 airport_fee 20782 non-null float64 19 date 265487 non-null object 20 hour 265487 non-null int32 21 Airport_fee 235874 non-null float64 dtypes: datetime64[us](2), float64(13), int32(1), int64(4), object(2) memory usage: 43.5+ MB
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | date | hour | Airport_fee | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:07:18 | 2023-01-01 00:23:15 | 1.0 | 7.74 | 1.0 | N | 138 | 256 | 2 | 32.40 | 6.0 | 0.5 | 0.00 | 0.0 | 1.0 | 41.15 | 0.0 | 1.25 | 2023-01-01 | 0 | NaN |
| 1 | 2 | 2023-01-01 00:16:41 | 2023-01-01 00:21:46 | 2.0 | 1.24 | 1.0 | N | 161 | 237 | 1 | 7.90 | 1.0 | 0.5 | 2.58 | 0.0 | 1.0 | 15.48 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 2 | 2 | 2023-01-01 00:14:03 | 2023-01-01 00:24:36 | 3.0 | 1.44 | 1.0 | N | 237 | 141 | 2 | 11.40 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 16.40 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 3 | 2 | 2023-01-01 00:24:30 | 2023-01-01 00:29:55 | 1.0 | 0.54 | 1.0 | N | 143 | 142 | 2 | 6.50 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 11.50 | 2.5 | 0.00 | 2023-01-01 | 0 | NaN |
| 4 | 2 | 2023-01-01 00:43:00 | 2023-01-01 01:01:00 | NaN | 19.24 | NaN | None | 66 | 107 | 0 | 25.64 | 0.0 | 0.5 | 5.93 | 0.0 | 1.0 | 35.57 | NaN | NaN | 2023-01-01 | 0 | NaN |
Comparison of airport fee columns: airport_fee unique values: [1.25 0. nan] Airport_fee unique values: [ nan 0. 1.75 -1.75 1.25] Count of non-null values: airport_fee: 20782 Airport_fee: 235874
Are values same where both columns have data: True
Observations
airport_fee has fewer non-null values (20,782) but simpler value range
Airport_fee has more non-null values (235,874) but includes additional values (1.75, -1.7)
Where both columns have values, they match (comparison.all() is True)
2.1.1 [2 marks]
Fix the index and drop unnecessary columns
2.1.2 [3 marks]
There are two airport fee columns. This is possibly an error in naming columns. Let's see whether these can be combined into a single column.
Updated Dataset Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 265487 entries, 0 to 265486 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 265487 non-null int64 1 tpep_pickup_datetime 265487 non-null datetime64[us] 2 tpep_dropoff_datetime 265487 non-null datetime64[us] 3 passenger_count 256656 non-null float64 4 trip_distance 265487 non-null float64 5 RatecodeID 256656 non-null float64 6 PULocationID 265487 non-null int64 7 DOLocationID 265487 non-null int64 8 payment_type 265487 non-null int64 9 fare_amount 265487 non-null float64 10 extra 265487 non-null float64 11 mta_tax 265487 non-null float64 12 tip_amount 265487 non-null float64 13 tolls_amount 265487 non-null float64 14 improvement_surcharge 265487 non-null float64 15 total_amount 265487 non-null float64 16 congestion_surcharge 256656 non-null float64 17 airport_fee 256656 non-null float64 dtypes: datetime64[us](2), float64(12), int64(4) memory usage: 36.5 MB None Unique values in airport_fee after combining: airport_fee 0.00 234050 1.75 16911 NaN 8831 1.25 5693 -1.75 2 Name: count, dtype: int64 First few rows: VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \ 0 2 2023-01-01 00:07:18 2023-01-01 00:23:15 1.0 1 2 2023-01-01 00:16:41 2023-01-01 00:21:46 2.0 2 2 2023-01-01 00:14:03 2023-01-01 00:24:36 3.0 3 2 2023-01-01 00:24:30 2023-01-01 00:29:55 1.0 4 2 2023-01-01 00:43:00 2023-01-01 01:01:00 NaN trip_distance RatecodeID PULocationID DOLocationID payment_type \ 0 7.74 1.0 138 256 2 1 1.24 1.0 161 237 1 2 1.44 1.0 237 141 2 3 0.54 1.0 143 142 2 4 19.24 NaN 66 107 0 fare_amount extra mta_tax tip_amount tolls_amount \ 0 32.40 6.0 0.5 0.00 0.0 1 7.90 1.0 0.5 2.58 0.0 2 11.40 1.0 0.5 0.00 0.0 3 6.50 1.0 0.5 0.00 0.0 4 25.64 0.0 0.5 5.93 0.0 improvement_surcharge total_amount congestion_surcharge airport_fee 0 1.0 41.15 0.0 1.25 1 1.0 15.48 2.5 0.00 2 1.0 16.40 2.5 0.00 3 1.0 11.50 2.5 0.00 4 1.0 35.57 NaN NaN
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:07:18 | 2023-01-01 00:23:15 | 1.0 | 7.74 | 1.0 | 138 | 256 | 2 | 32.40 | 6.00 | 0.5 | 0.00 | 0.00 | 1.0 | 41.15 | 0.0 | 1.25 |
| 1 | 2 | 2023-01-01 00:16:41 | 2023-01-01 00:21:46 | 2.0 | 1.24 | 1.0 | 161 | 237 | 1 | 7.90 | 1.00 | 0.5 | 2.58 | 0.00 | 1.0 | 15.48 | 2.5 | 0.00 |
| 2 | 2 | 2023-01-01 00:14:03 | 2023-01-01 00:24:36 | 3.0 | 1.44 | 1.0 | 237 | 141 | 2 | 11.40 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 16.40 | 2.5 | 0.00 |
| 3 | 2 | 2023-01-01 00:24:30 | 2023-01-01 00:29:55 | 1.0 | 0.54 | 1.0 | 143 | 142 | 2 | 6.50 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 11.50 | 2.5 | 0.00 |
| 4 | 2 | 2023-01-01 00:43:00 | 2023-01-01 01:01:00 | NaN | 19.24 | NaN | 66 | 107 | 0 | 25.64 | 0.00 | 0.5 | 5.93 | 0.00 | 1.0 | 35.57 | NaN | NaN |
| 5 | 1 | 2023-01-01 00:42:56 | 2023-01-01 01:16:33 | 2.0 | 7.10 | 1.0 | 246 | 37 | 1 | 34.50 | 3.50 | 0.5 | 7.90 | 0.00 | 1.0 | 47.40 | 2.5 | 0.00 |
| 6 | 2 | 2023-01-01 00:58:00 | 2023-01-01 01:08:31 | 2.0 | 1.59 | 1.0 | 79 | 164 | 1 | 11.40 | 1.00 | 0.5 | 3.28 | 0.00 | 1.0 | 19.68 | 2.5 | 0.00 |
| 7 | 2 | 2023-01-01 00:16:06 | 2023-01-01 00:31:59 | 1.0 | 3.16 | 1.0 | 79 | 256 | 1 | 19.10 | 1.00 | 0.5 | 6.02 | 0.00 | 1.0 | 30.12 | 2.5 | 0.00 |
| 8 | 2 | 2023-01-01 00:44:09 | 2023-01-01 01:01:17 | 1.0 | 7.64 | 1.0 | 132 | 95 | 1 | 31.70 | 1.00 | 0.5 | 7.09 | 0.00 | 1.0 | 42.54 | 0.0 | 1.25 |
| 9 | 1 | 2023-01-01 00:15:25 | 2023-01-01 00:22:35 | 2.0 | 0.90 | 1.0 | 148 | 114 | 1 | 7.90 | 3.50 | 0.5 | 1.00 | 0.00 | 1.0 | 13.90 | 2.5 | 0.00 |
| 10 | 2 | 2023-01-01 00:14:47 | 2023-01-01 00:20:18 | 1.0 | 0.78 | 1.0 | 237 | 229 | 2 | 7.20 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 12.20 | 2.5 | 0.00 |
| 11 | 2 | 2023-01-01 00:24:48 | 2023-01-01 00:45:04 | 1.0 | 2.78 | 1.0 | 79 | 186 | 1 | 19.80 | 1.00 | 0.5 | 4.96 | 0.00 | 1.0 | 29.76 | 2.5 | 0.00 |
| 12 | 2 | 2023-01-01 00:27:33 | 2023-01-01 00:36:38 | 2.0 | 0.85 | 1.0 | 90 | 68 | 1 | 10.00 | 1.00 | 0.5 | 3.75 | 0.00 | 1.0 | 18.75 | 2.5 | 0.00 |
| 13 | 2 | 2023-01-01 00:13:02 | 2023-01-01 00:38:02 | 1.0 | 4.37 | 1.0 | 142 | 148 | 1 | 25.40 | 1.00 | 0.5 | 9.12 | 0.00 | 1.0 | 39.52 | 2.5 | 0.00 |
| 14 | 2 | 2023-01-01 00:06:39 | 2023-01-01 00:25:25 | 1.0 | 2.43 | 1.0 | 144 | 186 | 1 | 18.40 | 1.00 | 0.5 | 4.68 | 0.00 | 1.0 | 28.08 | 2.5 | 0.00 |
| 15 | 2 | 2023-01-01 00:41:50 | 2023-01-01 01:14:50 | NaN | 10.77 | NaN | 151 | 106 | 0 | 45.38 | 0.00 | 0.5 | 11.19 | 6.55 | 1.0 | 67.12 | NaN | NaN |
| 16 | 2 | 2023-01-01 00:56:42 | 2023-01-01 01:00:25 | 1.0 | 0.74 | 1.0 | 229 | 141 | 1 | 5.80 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 10.80 | 2.5 | 0.00 |
| 17 | 2 | 2023-01-01 00:56:31 | 2023-01-01 01:07:51 | 2.0 | 1.73 | 1.0 | 144 | 113 | 1 | 12.10 | 1.00 | 0.5 | 2.05 | 0.00 | 1.0 | 19.15 | 2.5 | 0.00 |
| 18 | 2 | 2023-01-01 00:21:44 | 2023-01-01 00:36:40 | 1.0 | 2.95 | 1.0 | 164 | 236 | 1 | 17.70 | 1.00 | 0.5 | 5.68 | 0.00 | 1.0 | 28.38 | 2.5 | 0.00 |
| 19 | 1 | 2023-01-01 00:13:56 | 2023-01-01 00:23:05 | 4.0 | 0.70 | 1.0 | 170 | 170 | 1 | 7.90 | 3.50 | 0.5 | 2.55 | 0.00 | 1.0 | 15.45 | 2.5 | 0.00 |
| 20 | 1 | 2023-01-01 00:58:43 | 2023-01-01 01:06:07 | 2.0 | 1.70 | 1.0 | 107 | 229 | 1 | 9.30 | 3.50 | 0.5 | 1.00 | 0.00 | 1.0 | 15.30 | 2.5 | 0.00 |
| 21 | 2 | 2023-01-01 00:52:35 | 2023-01-01 01:14:24 | 1.0 | 1.83 | 1.0 | 164 | 246 | 1 | 18.40 | 1.00 | 0.5 | 2.70 | 0.00 | 1.0 | 26.10 | 2.5 | 0.00 |
| 22 | 2 | 2023-01-01 00:47:20 | 2023-01-01 01:31:34 | 6.0 | 4.69 | 1.0 | 90 | 238 | 2 | 40.10 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 45.10 | 2.5 | 0.00 |
| 23 | 2 | 2023-01-01 00:13:02 | 2023-01-01 00:26:20 | 1.0 | 1.46 | 1.0 | 158 | 107 | 1 | 13.50 | 1.00 | 0.5 | 1.00 | 0.00 | 1.0 | 19.50 | 2.5 | 0.00 |
| 24 | 1 | 2023-01-01 00:18:33 | 2023-01-01 00:54:44 | 1.0 | 21.00 | 2.0 | 132 | 238 | 1 | 70.00 | 3.75 | 0.5 | 0.00 | 6.55 | 1.0 | 81.80 | 2.5 | 1.25 |
| 25 | 2 | 2023-01-01 00:58:15 | 2023-01-01 01:12:43 | 1.0 | 2.48 | 1.0 | 264 | 263 | 1 | 15.60 | 1.00 | 0.5 | 3.62 | 0.00 | 1.0 | 21.72 | 0.0 | 0.00 |
| 26 | 2 | 2023-01-01 00:24:12 | 2023-01-01 00:48:09 | 2.0 | 7.07 | 1.0 | 144 | 7 | 1 | 32.40 | 1.00 | 0.5 | 9.35 | 0.00 | 1.0 | 46.75 | 2.5 | 0.00 |
| 27 | 2 | 2023-01-01 00:41:09 | 2023-01-01 01:00:39 | 3.0 | 4.90 | 1.0 | 107 | 238 | 1 | 23.30 | 1.00 | 0.5 | 5.66 | 0.00 | 1.0 | 33.96 | 2.5 | 0.00 |
| 28 | 2 | 2023-01-01 00:35:20 | 2023-01-01 00:47:11 | 1.0 | 6.86 | 1.0 | 132 | 222 | 2 | 28.20 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 30.70 | 0.0 | 0.00 |
| 29 | 2 | 2023-01-01 00:28:15 | 2023-01-01 00:42:35 | 5.0 | 3.22 | 1.0 | 261 | 186 | 2 | 17.00 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 22.00 | 2.5 | 0.00 |
| 30 | 1 | 2023-01-01 00:04:34 | 2023-01-01 00:17:32 | 1.0 | 1.60 | 1.0 | 125 | 186 | 1 | 12.80 | 3.50 | 0.5 | 3.00 | 0.00 | 1.0 | 20.80 | 2.5 | 0.00 |
| 31 | 2 | 2023-01-01 00:41:22 | 2023-01-01 00:47:01 | 1.0 | 0.78 | 1.0 | 164 | 234 | 1 | 7.20 | 1.00 | 0.5 | 2.44 | 0.00 | 1.0 | 14.64 | 2.5 | 0.00 |
| 32 | 2 | 2023-01-01 00:13:04 | 2023-01-01 00:22:10 | 1.0 | 1.52 | 1.0 | 79 | 186 | 1 | 10.00 | 1.00 | 0.5 | 1.25 | 0.00 | 1.0 | 16.25 | 2.5 | 0.00 |
| 33 | 2 | 2023-01-01 00:06:59 | 2023-01-01 00:32:48 | 2.0 | 9.61 | 1.0 | 234 | 119 | 2 | 40.10 | 1.00 | 0.5 | 0.00 | 0.00 | 1.0 | 45.10 | 2.5 | 0.00 |
| 34 | 1 | 2023-01-01 00:30:47 | 2023-01-01 01:01:43 | 3.0 | 3.90 | 1.0 | 48 | 263 | 1 | 28.20 | 3.50 | 0.5 | 4.00 | 0.00 | 1.0 | 37.20 | 2.5 | 0.00 |
| 35 | 1 | 2023-01-01 00:16:59 | 2023-01-01 00:22:09 | 1.0 | 0.80 | 1.0 | 162 | 170 | 2 | 6.50 | 3.50 | 0.5 | 0.00 | 0.00 | 1.0 | 11.50 | 2.5 | 0.00 |
| 36 | 1 | 2023-01-01 00:43:36 | 2023-01-01 00:54:58 | 1.0 | 2.40 | 1.0 | 141 | 238 | 1 | 14.20 | 3.50 | 0.5 | 3.80 | 0.00 | 1.0 | 23.00 | 2.5 | 0.00 |
| 37 | 2 | 2023-01-01 01:22:58 | 2023-01-01 01:47:54 | 1.0 | 3.64 | 1.0 | 144 | 163 | 1 | 24.00 | 1.00 | 0.5 | 5.80 | 0.00 | 1.0 | 34.80 | 2.5 | 0.00 |
| 38 | 2 | 2023-01-01 01:48:45 | 2023-01-01 01:56:29 | 2.0 | 0.98 | 1.0 | 162 | 164 | 1 | 9.30 | 1.00 | 0.5 | 0.02 | 0.00 | 1.0 | 14.32 | 2.5 | 0.00 |
| 39 | 2 | 2023-01-01 01:03:47 | 2023-01-01 01:59:39 | 2.0 | 10.61 | 1.0 | 161 | 177 | 1 | 58.30 | 1.00 | 0.5 | 12.66 | 0.00 | 1.0 | 75.96 | 2.5 | 0.00 |
2.1.4 [5 marks]
Fix columns with negative (monetary) values
Number of negative values in monetary columns: extra: 1 negative values Sample of negative values: 42134 -2.5 Name: extra, dtype: float64 -------------------------------------------------- mta_tax: 11 negative values Sample of negative values: 47880 -0.5 50486 -0.5 63978 -0.5 90839 -0.5 94954 -0.5 Name: mta_tax, dtype: float64 -------------------------------------------------- improvement_surcharge: 11 negative values Sample of negative values: 47880 -1.0 50486 -1.0 63978 -1.0 90839 -1.0 94954 -1.0 Name: improvement_surcharge, dtype: float64 -------------------------------------------------- total_amount: 11 negative values Sample of negative values: 47880 -4.00 50486 -1.50 63978 -3.25 90839 -4.00 94954 -1.50 Name: total_amount, dtype: float64 -------------------------------------------------- congestion_surcharge: 6 negative values Sample of negative values: 47880 -2.5 90839 -2.5 134619 -2.5 138638 -2.5 139776 -2.5 Name: congestion_surcharge, dtype: float64 -------------------------------------------------- airport_fee: 2 negative values Sample of negative values: 63978 -1.75 203964 -1.75 Name: airport_fee, dtype: float64 --------------------------------------------------
Did you notice something different in the RatecodeID column for above records?
Unique values in RatecodeID: RatecodeID 1.0 242253 2.0 10081 99.0 1530 5.0 1428 3.0 845 4.0 519 Name: count, dtype: int64 Number of NaN values: 8831
Sample records with RatecodeID = 99:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
231 1 2023-01-01 11:39:40 2023-01-01 11:55:59 1.0
698 1 2023-01-02 13:09:05 2023-01-02 13:25:32 1.0
1238 1 2023-01-03 13:07:10 2023-01-03 13:42:25 1.0
1305 1 2023-01-03 15:30:26 2023-01-03 16:39:39 1.0
1646 1 2023-01-04 07:35:07 2023-01-04 07:56:45 1.0
trip_distance RatecodeID PULocationID DOLocationID payment_type \
231 2.9 99.0 42 47 1
698 6.9 99.0 137 42 1
1238 0.0 99.0 14 216 1
1305 13.0 99.0 141 35 1
1646 3.3 99.0 218 130 1
fare_amount extra mta_tax tip_amount tolls_amount \
231 22.5 0.0 0.5 0.0 0.00
698 31.5 0.0 0.5 0.0 0.00
1238 46.5 0.0 0.5 0.0 0.00
1305 55.5 0.0 0.5 0.0 6.55
1646 20.5 0.0 0.5 0.0 0.00
improvement_surcharge total_amount congestion_surcharge airport_fee
231 1.0 24.00 0.0 0.0
698 1.0 33.00 0.0 0.0
1238 1.0 48.00 0.0 0.0
1305 1.0 63.55 0.0 0.0
1646 1.0 22.00 0.0 0.0
Before fix - RatecodeID unique values: RatecodeID 1.0 242253 2.0 10081 NaN 8831 99.0 1530 5.0 1428 3.0 845 4.0 519 Name: count, dtype: int64 After fix - RatecodeID unique values: RatecodeID 1 251084 2 10081 99 1530 5 1428 3 845 4 519 Name: count, dtype: int64 Number of negative values in monetary columns: extra: 1 negative values mta_tax: 11 negative values improvement_surcharge: 11 negative values total_amount: 11 negative values congestion_surcharge: 6 negative values airport_fee: 2 negative values
After cleaning RatecodeID (including 99): RatecodeID 1 252614 2 10081 5 1428 3 845 4 519 Name: count, dtype: int64
Sample records with negative values in monetary columns:
Negative values in extra:
extra total_amount fare_amount
42134 -2.5 4.0 3.0
Negative values in mta_tax:
mta_tax total_amount fare_amount
47880 -0.5 -4.00 0.0
50486 -0.5 -1.50 0.0
63978 -0.5 -3.25 0.0
90839 -0.5 -4.00 0.0
94954 -0.5 -1.50 0.0
Negative values in improvement_surcharge:
improvement_surcharge total_amount fare_amount
47880 -1.0 -4.00 0.0
50486 -1.0 -1.50 0.0
63978 -1.0 -3.25 0.0
90839 -1.0 -4.00 0.0
94954 -1.0 -1.50 0.0
Negative values in total_amount:
total_amount total_amount fare_amount
47880 -4.00 -4.00 0.0
50486 -1.50 -1.50 0.0
63978 -3.25 -3.25 0.0
90839 -4.00 -4.00 0.0
94954 -1.50 -1.50 0.0
Negative values in congestion_surcharge:
congestion_surcharge total_amount fare_amount
47880 -2.5 -4.0 0.0
90839 -2.5 -4.0 0.0
134619 -2.5 -4.0 0.0
138638 -2.5 -4.0 0.0
139776 -2.5 -4.0 0.0
Negative values in airport_fee:
airport_fee total_amount fare_amount
63978 -1.75 -3.25 0.0
203964 -1.75 -3.25 0.0
These look like refunds or charge reversals rather than errors. Let's handle them appropriately:
Number of refund transactions: 11
Summary of refund transactions:
fare_amount extra mta_tax tip_amount tolls_amount \
count 11.0 11.0 11.0 11.0 11.0
mean 0.0 0.0 -0.5 0.0 0.0
std 0.0 0.0 0.0 0.0 0.0
min 0.0 0.0 -0.5 0.0 0.0
25% 0.0 0.0 -0.5 0.0 0.0
50% 0.0 0.0 -0.5 0.0 0.0
75% 0.0 0.0 -0.5 0.0 0.0
max 0.0 0.0 -0.5 0.0 0.0
improvement_surcharge total_amount congestion_surcharge airport_fee
count 11.0 11.000000 11.000000 11.000000
mean -1.0 -3.181818 -1.363636 -0.318182
std 0.0 1.118542 1.305582 0.707910
min -1.0 -4.000000 -2.500000 -1.750000
25% -1.0 -4.000000 -2.500000 0.000000
50% -1.0 -4.000000 -2.500000 0.000000
75% -1.0 -2.375000 0.000000 0.000000
max -1.0 -1.500000 0.000000 0.000000
Number of refund transactions flagged: 11
Final check - Number of negative values in each column: extra: 1 negative values mta_tax: 11 negative values improvement_surcharge: 11 negative values total_amount: 11 negative values congestion_surcharge: 6 negative values airport_fee: 2 negative values Total number of refund transactions: 11
2.2 Handling Missing Values¶
[10 marks]
2.2.1 [2 marks]
Find the proportion of missing values in each column
Proportion of missing values in each column: Column Name Missing % --------------------------------------------- airport_fee 3.33% passenger_count 3.33% congestion_surcharge 3.33% Total number of rows in dataset: 265487
Missing value analysis:
---------------------------------------------
airport_fee 8831
passenger_count 8831
congestion_surcharge 8831
overlapping 8831
Sample of rows with missing values:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
4 2 2023-01-01 00:43:00 2023-01-01 01:01:00 NaN
15 2 2023-01-01 00:41:50 2023-01-01 01:14:50 NaN
40 2 2023-01-01 01:46:00 2023-01-01 02:15:00 NaN
41 2 2023-01-01 01:10:00 2023-01-01 01:41:00 NaN
56 2 2023-01-01 01:18:30 2023-01-01 01:46:19 NaN
69 2 2023-01-01 01:03:24 2023-01-01 01:17:10 NaN
84 1 2023-01-01 02:51:58 2023-01-01 03:00:53 NaN
93 2 2023-01-01 02:12:18 2023-01-01 02:42:53 NaN
97 2 2023-01-01 02:10:40 2023-01-01 02:39:28 NaN
119 2 2023-01-01 03:03:00 2023-01-01 03:30:00 NaN
134 2 2023-01-01 03:22:00 2023-01-01 03:57:00 NaN
135 2 2023-01-01 03:33:56 2023-01-01 04:08:08 NaN
142 2 2023-01-01 04:48:00 2023-01-01 05:06:00 NaN
155 2 2023-01-01 04:30:57 2023-01-01 04:44:37 NaN
167 2 2023-01-01 06:39:00 2023-01-01 06:43:00 NaN
168 2 2023-01-01 06:54:00 2023-01-01 07:11:00 NaN
176 2 2023-01-01 08:22:57 2023-01-01 08:36:19 NaN
217 2 2023-01-01 11:13:00 2023-01-01 11:31:00 NaN
221 1 2023-01-01 11:18:41 2023-01-01 11:35:48 NaN
232 1 2023-01-01 11:17:24 2023-01-01 11:22:59 NaN
trip_distance RatecodeID PULocationID DOLocationID payment_type \
4 19.24 1 66 107 0
15 10.77 1 151 106 0
40 4.97 1 50 226 0
41 5.73 1 186 255 0
56 2.98 1 137 48 0
69 2.37 1 50 141 0
84 2.10 1 162 236 0
93 9.79 1 189 262 0
97 12.96 1 106 42 0
119 2.24 1 37 141 0
134 11.34 1 209 138 0
135 8.82 1 61 140 0
142 6.88 1 68 42 0
155 3.51 1 107 142 0
167 0.71 1 107 170 0
168 10.81 1 233 138 0
176 4.65 1 237 232 0
217 3.77 1 236 69 0
221 9.30 1 238 261 0
232 1.50 1 142 238 0
fare_amount extra mta_tax tip_amount tolls_amount \
4 25.64 0.0 0.5 5.93 0.00
15 45.38 0.0 0.5 11.19 6.55
40 27.62 0.0 0.5 6.32 0.00
41 27.32 0.0 0.5 6.26 0.00
56 27.54 0.0 0.5 3.15 0.00
69 17.32 0.0 0.5 3.20 0.00
84 11.40 1.0 0.5 1.64 0.00
93 41.04 0.0 0.5 2.00 0.00
97 50.06 0.0 0.5 10.81 0.00
119 31.29 0.0 0.5 1.76 0.00
134 44.37 0.0 0.5 9.67 0.00
135 44.44 0.0 0.5 9.69 0.00
142 30.64 0.0 0.5 3.46 0.00
155 18.58 0.0 0.5 4.52 0.00
167 21.63 0.0 0.5 5.13 0.00
168 38.54 0.0 0.5 0.00 6.55
176 25.24 0.0 0.5 5.85 0.00
217 31.85 0.0 0.5 0.00 0.00
221 35.90 0.0 0.5 2.00 0.00
232 8.60 0.0 0.5 6.30 0.00
improvement_surcharge total_amount congestion_surcharge airport_fee \
4 1.0 35.57 NaN NaN
15 1.0 67.12 NaN NaN
40 1.0 37.94 NaN NaN
41 1.0 37.58 NaN NaN
56 1.0 34.69 NaN NaN
69 1.0 24.52 NaN NaN
84 1.0 18.04 NaN NaN
93 1.0 47.04 NaN NaN
97 1.0 64.87 NaN NaN
119 1.0 37.05 NaN NaN
134 1.0 58.04 NaN NaN
135 1.0 58.13 NaN NaN
142 1.0 38.10 NaN NaN
155 1.0 27.10 NaN NaN
167 1.0 30.76 NaN NaN
168 1.0 49.09 NaN NaN
176 1.0 35.09 NaN NaN
217 1.0 35.85 NaN NaN
221 1.0 41.90 NaN NaN
232 1.0 18.90 NaN NaN
is_refund
4 False
15 False
40 False
41 False
56 False
69 False
84 False
93 False
97 False
119 False
134 False
135 False
142 False
155 False
167 False
168 False
176 False
217 False
221 False
232 False
Missing values after fixing: Series([], dtype: int64) head(15)
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | is_refund | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:07:18 | 2023-01-01 00:23:15 | 1.0 | 7.74 | 1 | 138 | 256 | 2 | 32.40 | 6.0 | 0.5 | 0.00 | 0.0 | 1.0 | 41.15 | 0.0 | 1.25 | False |
| 1 | 2 | 2023-01-01 00:16:41 | 2023-01-01 00:21:46 | 2.0 | 1.24 | 1 | 161 | 237 | 1 | 7.90 | 1.0 | 0.5 | 2.58 | 0.0 | 1.0 | 15.48 | 2.5 | 0.00 | False |
| 2 | 2 | 2023-01-01 00:14:03 | 2023-01-01 00:24:36 | 3.0 | 1.44 | 1 | 237 | 141 | 2 | 11.40 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 16.40 | 2.5 | 0.00 | False |
| 3 | 2 | 2023-01-01 00:24:30 | 2023-01-01 00:29:55 | 1.0 | 0.54 | 1 | 143 | 142 | 2 | 6.50 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 11.50 | 2.5 | 0.00 | False |
| 4 | 2 | 2023-01-01 00:43:00 | 2023-01-01 01:01:00 | 1.0 | 19.24 | 1 | 66 | 107 | 1 | 25.64 | 0.0 | 0.5 | 5.93 | 0.0 | 1.0 | 35.57 | 0.0 | 0.00 | False |
| 5 | 1 | 2023-01-01 00:42:56 | 2023-01-01 01:16:33 | 2.0 | 7.10 | 1 | 246 | 37 | 1 | 34.50 | 3.5 | 0.5 | 7.90 | 0.0 | 1.0 | 47.40 | 2.5 | 0.00 | False |
| 6 | 2 | 2023-01-01 00:58:00 | 2023-01-01 01:08:31 | 2.0 | 1.59 | 1 | 79 | 164 | 1 | 11.40 | 1.0 | 0.5 | 3.28 | 0.0 | 1.0 | 19.68 | 2.5 | 0.00 | False |
| 7 | 2 | 2023-01-01 00:16:06 | 2023-01-01 00:31:59 | 1.0 | 3.16 | 1 | 79 | 256 | 1 | 19.10 | 1.0 | 0.5 | 6.02 | 0.0 | 1.0 | 30.12 | 2.5 | 0.00 | False |
| 8 | 2 | 2023-01-01 00:44:09 | 2023-01-01 01:01:17 | 1.0 | 7.64 | 1 | 132 | 95 | 1 | 31.70 | 1.0 | 0.5 | 7.09 | 0.0 | 1.0 | 42.54 | 0.0 | 1.25 | False |
| 9 | 1 | 2023-01-01 00:15:25 | 2023-01-01 00:22:35 | 2.0 | 0.90 | 1 | 148 | 114 | 1 | 7.90 | 3.5 | 0.5 | 1.00 | 0.0 | 1.0 | 13.90 | 2.5 | 0.00 | False |
| 10 | 2 | 2023-01-01 00:14:47 | 2023-01-01 00:20:18 | 1.0 | 0.78 | 1 | 237 | 229 | 2 | 7.20 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 12.20 | 2.5 | 0.00 | False |
| 11 | 2 | 2023-01-01 00:24:48 | 2023-01-01 00:45:04 | 1.0 | 2.78 | 1 | 79 | 186 | 1 | 19.80 | 1.0 | 0.5 | 4.96 | 0.0 | 1.0 | 29.76 | 2.5 | 0.00 | False |
| 12 | 2 | 2023-01-01 00:27:33 | 2023-01-01 00:36:38 | 2.0 | 0.85 | 1 | 90 | 68 | 1 | 10.00 | 1.0 | 0.5 | 3.75 | 0.0 | 1.0 | 18.75 | 2.5 | 0.00 | False |
| 13 | 2 | 2023-01-01 00:13:02 | 2023-01-01 00:38:02 | 1.0 | 4.37 | 1 | 142 | 148 | 1 | 25.40 | 1.0 | 0.5 | 9.12 | 0.0 | 1.0 | 39.52 | 2.5 | 0.00 | False |
| 14 | 2 | 2023-01-01 00:06:39 | 2023-01-01 00:25:25 | 1.0 | 2.43 | 1 | 144 | 186 | 1 | 18.40 | 1.0 | 0.5 | 4.68 | 0.0 | 1.0 | 28.08 | 2.5 | 0.00 | False |
2.2.2 [3 marks]
Handling missing values in passenger_count
Did you find zeroes in passenger_count? Handle these.
2.2.3 [2 marks]
Handle missing values in RatecodeID
Missing values for passenger_count and RatecodeID have been already handled, lets verify to be sure
Current state of columns:
passenger_count unique values:
passenger_count
1.0 202229
2.0 38669
3.0 9640
4.0 5340
0.0 4070
5.0 3300
6.0 2234
8.0 2
7.0 2
9.0 1
Name: count, dtype: int64
RatecodeID unique values:
RatecodeID
1 252614
2 10081
5 1428
3 845
4 519
Name: count, dtype: int64
congestion_surcharge unique values:
congestion_surcharge
2.5 236938
0.0 28543
-2.5 6
Name: count, dtype: int64
Remaining missing values:
passenger_count 0
RatecodeID 0
congestion_surcharge 0
dtype: int64
Summary statistics:
passenger_count RatecodeID congestion_surcharge
count 265487.000000 265487.000000 265487.000000
mean 1.355208 1.071717 2.231107
std 0.878186 0.386284 0.774734
min 0.000000 1.000000 -2.500000
25% 1.000000 1.000000 2.500000
50% 1.000000 1.000000 2.500000
75% 1.000000 1.000000 2.500000
max 9.000000 5.000000 2.500000
2.2.4 [3 marks]
Impute NaN in congestion_surcharge
Are there missing values in other columns? Did you find NaN values in some other set of columns? Handle those missing values below.
After cleaning - Passenger count values: passenger_count 1 206304 2 38669 3 9640 4 5340 5 3300 6 2234 Name: count, dtype: int64 After cleaning - Congestion surcharge values: congestion_surcharge -2.5 6 0.0 28543 2.5 236938 Name: count, dtype: int64
2.3 Handling Outliers¶
[10 marks]
Before we start fixing outliers, let's perform outlier analysis.
Outlier Analysis for trip_distance: -------------------------------------------------- Q1: 1.05 Q3: 3.40 IQR: 2.35 Lower bound: -2.47 Upper bound: 6.92 Number of outliers: 34713 Percentage of outliers: 13.08% Example outliers: 0 7.74 4 19.24 5 7.10 8 7.64 15 10.77 Name: trip_distance, dtype: float64 Outlier Analysis for fare_amount: -------------------------------------------------- Q1: 9.30 Q3: 21.90 IQR: 12.60 Lower bound: -9.60 Upper bound: 40.80 Number of outliers: 27440 Percentage of outliers: 10.34% Example outliers: 15 45.38 24 70.00 39 58.30 49 45.70 52 50.00 Name: fare_amount, dtype: float64 Outlier Analysis for total_amount: -------------------------------------------------- Q1: 15.96 Q3: 30.80 IQR: 14.84 Lower bound: -6.30 Upper bound: 53.06 Number of outliers: 30566 Percentage of outliers: 11.51% Example outliers: 15 67.12 24 81.80 39 75.96 49 61.00 52 61.20 Name: total_amount, dtype: float64 Outlier Analysis for tip_amount: -------------------------------------------------- Q1: 1.00 Q3: 4.42 IQR: 3.42 Lower bound: -4.13 Upper bound: 9.55 Number of outliers: 20376 Percentage of outliers: 7.67% Example outliers: 15 11.19 39 12.66 52 10.20 67 14.20 71 52.00 Name: tip_amount, dtype: float64
2.3.1 [10 marks]
Based on the above analysis, it seems that some of the outliers are present due to errors in registering the trips. Fix the outliers.
Some points you can look for:
- Entries where
trip_distanceis nearly 0 andfare_amountis more than 300 - Entries where
trip_distanceandfare_amountare 0 but the pickup and dropoff zones are different (both distance and fare should not be zero for different zones) - Entries where
trip_distanceis more than 250 miles. - Entries where
payment_typeis 0 (there is no payment_type 0 defined in the data dictionary)
These are just some suggestions. You can handle outliers in any way you wish, using the insights from above outlier analysis.
How will you fix each of these values? Which ones will you drop and which ones will you replace?
First, let us remove 7+ passenger counts as there are very less instances.
Analyzing suspicious trips... Suspicious trip counts: Zero distance, non-zero fare: 5226 Very long trips (>100 miles): 5 Very high fares (>$300): 23 High fare per mile (>$50/mile): 869 High tip proportion (>50% of fare): 2770 Same location but non-zero distance: 11872
Summary after cleaning: Trip Distance Statistics: count 265487.000000 mean 3.544521 std 4.570818 min 0.000000 25% 1.100000 50% 1.840000 75% 3.520000 max 100.000000 Name: trip_distance, dtype: float64 Fare Amount Statistics: count 265487.000000 mean 19.734881 std 17.961350 min 0.000000 25% 9.300000 50% 13.500000 75% 21.900000 max 500.000000 Name: fare_amount, dtype: float64 Tip Amount Statistics: count 265487.000000 mean 3.483712 std 3.878590 min 0.000000 25% 1.000000 50% 2.800000 75% 4.400000 max 103.840000 Name: tip_amount, dtype: float64 Remaining suspicious cases: Zero distance, non-zero fare: 0 Very long trips (>100 miles): 0 High fare per mile (>$50/mile): 1
Final Summary Statistics: Trip Distance Statistics: count 265487.000000 mean 3.544521 std 4.570818 min 0.000000 25% 1.100000 50% 1.840000 75% 3.520000 max 100.000000 Name: trip_distance, dtype: float64 Fare Amount Statistics: count 265487.000000 mean 19.734881 std 17.961350 min 0.000000 25% 9.300000 50% 13.500000 75% 21.900000 max 500.000000 Name: fare_amount, dtype: float64 Tip Amount Statistics: count 265487.000000 mean 3.482470 std 3.859544 min 0.000000 25% 1.000000 50% 2.800000 75% 4.400000 max 50.000000 Name: tip_amount, dtype: float64 Fare per Mile Statistics: count 265487.000000 mean 7.843279 std 4.224128 min 0.000000 25% 5.515320 50% 7.099237 75% 8.977273 max 50.000000 Name: fare_per_mile, dtype: float64
Min Max Imagine you're a teacher grading papers:
One test is out of 500 points (like our fare_amount: $0 to $500)
Another test is out of 911 points (like our total_amount: -$4 to $911)
To make these scores comparable, you convert everything to a 0-100% scale.
Scaled fare_amount: -------------------------------------------------- count 265487.000 mean 0.039 std 0.036 min 0.000 25% 0.019 50% 0.027 75% 0.044 max 1.000 Original vs Scaled Range: Original: Min = 0.00, Max = 500.00 Scaled: Min = 0.00, Max = 1.00 -------------------------------------------------- Scaled total_amount: -------------------------------------------------- count 265487.000 mean 0.036 std 0.025 min 0.000 25% 0.022 50% 0.027 75% 0.038 max 1.000 Original vs Scaled Range: Original: Min = -4.00, Max = 911.00 Scaled: Min = 0.00, Max = 1.00 --------------------------------------------------
Current ranges of numerical columns: trip_distance: Min: 0.00 Max: 100.00 Mean: 3.54 Std: 4.57 tip_amount: Min: 0.00 Max: 50.00 Mean: 3.48 Std: 3.86 tolls_amount: Min: 0.00 Max: 143.00 Mean: 0.59 Std: 2.18 congestion_surcharge: Min: -2.50 Max: 2.50 Mean: 2.23 Std: 0.77 fare_amount_scaled: Min: 0.00 Max: 1.00 Mean: 0.04 Std: 0.04 total_amount_scaled: Min: 0.00 Max: 1.00 Mean: 0.04 Std: 0.02 Columns that might need standardization:
Initial data from parquet files: Number of records: 265487 After sampling (0.007 rate): Original total records: 3,041,714 Sampled records: 265,487 Sampling reduced data by: 91.27% Min Max reduced data by: 0.00%
3 Exploratory Data Analysis¶
[90 marks]
['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee', 'is_refund', 'fare_amount_scaled', 'total_amount_scaled'] (265487, 21)
3.1 General EDA: Finding Patterns and Trends¶
[40 marks]
3.1.1 [3 marks]
Categorise the varaibles into Numerical or Categorical.
VendorID:tpep_pickup_datetime:tpep_dropoff_datetime:passenger_count:trip_distance:RatecodeID:PULocationID:DOLocationID:payment_type:pickup_hour:trip_duration:
The following monetary parameters belong in the same category, is it categorical or numerical?
fare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
Current Data Types in df_scaled: -------------------------------------------------- VendorID int64 tpep_pickup_datetime datetime64[us] tpep_dropoff_datetime datetime64[us] passenger_count int64 trip_distance float64 RatecodeID int64 PULocationID int64 DOLocationID int64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 airport_fee float64 is_refund bool fare_amount_scaled float64 total_amount_scaled float64 Numerical Variables: -------------------------------------------------- VendorID int64 passenger_count int64 trip_distance float64 RatecodeID int64 PULocationID int64 DOLocationID int64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 airport_fee float64 fare_amount_scaled float64 total_amount_scaled float64 Categorical Variables: -------------------------------------------------- tpep_pickup_datetime datetime64[us] tpep_dropoff_datetime datetime64[us] is_refund bool Numerical Variables Subcategories: -------------------------------------------------- Continuous Variables: trip_distance float64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 airport_fee float64 fare_amount_scaled float64 total_amount_scaled float64 Discrete Variables: VendorID int64 passenger_count int64 RatecodeID int64 PULocationID int64 DOLocationID int64 payment_type int64
Temporal Analysis¶
3.1.2 [5 marks]
Analyse the distribution of taxi pickups by hours, days of the week, and months.
Hourly Distribution of Pickups: -------------------------------------------------- hour 0 7508 1 5045 2 3321 3 2194 4 1502 5 1554 6 3673 7 7252 8 10055 9 11345 10 12307 11 13355 12 14507 13 14950 14 16023 15 16443 16 16461 17 17904 18 18746 19 16745 20 14928 21 14919 22 13810 23 10940 dtype: int64
Daily Distribution of Pickups: -------------------------------------------------- day_of_week Monday 33066 Tuesday 38385 Wednesday 40678 Thursday 41627 Friday 39511 Saturday 38672 Sunday 33548 dtype: int64
Monthly Distribution of Pickups: -------------------------------------------------- month January 21288 February 20208 March 23617 April 22810 May 24363 June 22939 July 20117 August 19546 September 19724 October 24408 November 23115 December 23352 dtype: int64
Financial Analysis¶
Take a look at the financial parameters like fare_amount, tip_amount, total_amount, and also trip_distance. Do these contain zero/negative values?
Zero/Negative Value Analysis: -------------------------------------------------- fare_amount: Zero values: 92 (0.03%) Negative values: 0 (0.00%) tip_amount: Zero values: 60984 (22.97%) Negative values: 0 (0.00%) total_amount: Zero values: 32 (0.01%) Negative values: 11 (0.00%) trip_distance: Zero values: 38 (0.01%) Negative values: 0 (0.00%)
Do you think it is beneficial to create a copy DataFrame leaving out the zero values from these?
Original vs Filtered Dataset: -------------------------------------------------- Original records: 265487 Records after removing zeros: 265395 Removed records: 92 (0.03%)
3.1.3 [2 marks]
Filter out the zero values from the above columns.
Note: The distance might be 0 in cases where pickup and drop is in the same zone. Do you think it is suitable to drop such cases of zero distance?
Analysis of Zero Distance Trips: -------------------------------------------------- Total zero distance trips: 38 Of these zero distance trips: Same pickup and dropoff location: 34 Different pickup and dropoff location: 4 Fare Statistics for Zero Distance Trips: count 38.0 mean 0.0 std 0.0 min 0.0 25% 0.0 50% 0.0 75% 0.0 max 0.0 Name: fare_amount, dtype: float64 Filtering Summary: -------------------------------------------------- Original records: 265487 Records after filtering: 265395 Removed records: 92 (0.03%) Statistics After Filtering: -------------------------------------------------- fare_amount: count 265395.000000 mean 19.741722 std 17.960703 min 0.010000 25% 9.300000 50% 13.500000 75% 21.900000 max 500.000000 Name: fare_amount, dtype: float64 tip_amount: count 265395.000000 mean 3.483677 std 3.859668 min 0.000000 25% 1.000000 50% 2.800000 75% 4.400000 max 50.000000 Name: tip_amount, dtype: float64 total_amount: count 265395.000000 mean 28.900014 std 22.767033 min 1.010000 25% 15.960000 50% 21.100000 75% 30.800000 max 911.000000 Name: total_amount, dtype: float64 trip_distance: count 265395.000000 mean 3.544425 std 4.568148 min 0.002000 25% 1.100000 50% 1.840000 75% 3.520000 max 100.000000 Name: trip_distance, dtype: float64
3.1.4 [3 marks]
Analyse the monthly revenue (total_amount) trend
Monthly Revenue Analysis:
--------------------------------------------------------------------------------
Total Revenue Average Fare Number of Trips \
tpep_pickup_datetime
January 584589.08 27.47 21280
February 554714.12 27.46 20203
March 665651.58 28.20 23608
April 652225.12 28.60 22803
May 716818.08 29.44 24350
June 676862.13 29.52 22930
July 580068.77 28.85 20108
August 567599.31 29.05 19539
September 598979.02 30.37 19720
October 728120.63 29.84 24403
November 667451.92 28.89 23104
December 676839.40 28.99 23347
Average Distance Revenue %
tpep_pickup_datetime
January 3.45 7.62
February 3.40 7.23
March 3.51 8.68
April 3.55 8.50
May 3.61 9.35
June 3.66 8.82
July 3.65 7.56
August 3.67 7.40
September 3.69 7.81
October 3.61 9.49
November 3.39 8.70
December 3.37 8.82
Key Insights:
--------------------------------------------------------------------------------
Highest Revenue Month: October
Lowest Revenue Month: February
Highest Average Fare Month: September
Lowest Average Fare Month: February
Busiest Month (Trip Count): October
Least Busy Month (Trip Count): August
3.1.5 [3 marks]
Show the proportion of each quarter of the year in the revenue
Quarterly Revenue Analysis:
--------------------------------------------------------------------------------
Total Revenue Average Fare Number of Trips Average Distance \
quarter
Q1 (Jan-Mar) 1804954.78 27.73 65091 3.45
Q2 (Apr-Jun) 2045905.33 29.19 70083 3.61
Q3 (Jul-Sep) 1746647.10 29.42 59367 3.67
Q4 (Oct-Dec) 2072411.95 29.25 70854 3.46
Revenue %
quarter
Q1 (Jan-Mar) 23.53
Q2 (Apr-Jun) 26.67
Q3 (Jul-Sep) 22.77
Q4 (Oct-Dec) 27.02
Quarterly Revenue Distribution:
--------------------------------------------------------------------------------
Q1 (Jan-Mar): $1,804,954.78 (23.53%)
Q2 (Apr-Jun): $2,045,905.33 (26.67%)
Q3 (Jul-Sep): $1,746,647.10 (22.77%)
Q4 (Oct-Dec): $2,072,411.95 (27.02%)
3.1.6 [3 marks]
Visualise the relationship between trip_distance and fare_amount. Also find the correlation value for these two.
Hint: You can leave out the trips with trip_distance = 0
Text(0, 0.5, 'Fare Amount ($)')
Trip Distance Distribution: -------------------------------------------------- Trips <= 30 miles: 265105 Trips > 30 miles: 290 Correlation coefficient: 0.9562611017212637
3.1.7 [5 marks]
Find and visualise the correlation between:
fare_amountand trip duration (pickup time to dropoff time)fare_amountandpassenger_counttip_amountandtrip_distance
Correlation Analysis: -------------------------------------------------- 1. Fare Amount vs Trip Duration: 0.2719 2. Fare Amount vs Passenger Count: 0.0411 3. Tip Amount vs Trip Distance: 0.6098 Summary Statistics: -------------------------------------------------- Trip Duration (minutes): count 265395.000000 mean 17.363668 std 40.268078 min -52.750000 25% 7.683333 50% 12.700000 75% 20.633333 max 5411.000000 Name: trip_duration, dtype: float64 Passenger Count: passenger_count 1 206221 2 38664 3 9637 4 5340 5 3300 6 2233 Name: count, dtype: int64
3.1.8 [3 marks]
Analyse the distribution of different payment types (payment_type)
Payment Type Distribution: -------------------------------------------------- Payment Method Count Percentage 0 Credit Card 217876 82.09 1 Cash 44419 16.74 2 No Charge 1201 0.45 3 Dispute 1899 0.72
Average Amounts by Payment Type:
--------------------------------------------------
fare_amount tip_amount total_amount
Credit Card 19.82 4.24 29.78
Cash 19.51 0.00 24.93
No Charge 15.68 0.00 21.16
Dispute 19.19 0.00 26.13
- 1= Credit card
- 2= Cash
- 3= No charge
- 4= Dispute
Geographical Analysis¶
For this, you have to use the taxi_zones.shp file from the taxi_zones folder.
There would be multiple files inside the folder (such as .shx, .sbx, .sbn etc). You do not need to import/read any of the files other than the shapefile, taxi_zones.shp.
Do not change any folder structure - all the files need to be present inside the folder for it to work.
The folder structure should look like this:
Taxi Zones
|- taxi_zones.shp.xml
|- taxi_zones.prj
|- taxi_zones.sbn
|- taxi_zones.shp
|- taxi_zones.dbf
|- taxi_zones.shx
|- taxi_zones.sbx
You only need to read the taxi_zones.shp file. The shp file will utilise the other files by itself.
We will use the GeoPandas library for geopgraphical analysis
import geopandas as gpd
More about geopandas and shapefiles: About
Reading the shapefile is very similar to Pandas. Use gpd.read_file() function to load the data (taxi_zones.shp) as a GeoDataFrame. Documentation: Reading and Writing Files
Requirement already satisfied: geopandas in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (1.0.1) Requirement already satisfied: numpy>=1.22 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (1.26.4) Requirement already satisfied: pyogrio>=0.7.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (0.10.0) Requirement already satisfied: packaging in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (23.1) Requirement already satisfied: pandas>=1.4.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (2.2.2) Requirement already satisfied: pyproj>=3.3.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (3.7.1) Requirement already satisfied: shapely>=2.0.0 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (2.0.7) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.4.0->geopandas) (2024.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.4.0->geopandas) (2024.2) Requirement already satisfied: certifi in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from pyogrio>=0.7.2->geopandas) (2022.12.7) Requirement already satisfied: six>=1.5 in c:\users\bvspa\appdata\local\programs\python\python311\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)
[notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
3.1.9 [2 marks]
Load the shapefile and display it.
Taxi Zones Data Info:
--------------------------------------------------
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OBJECTID 263 non-null int32
1 Shape_Leng 263 non-null float64
2 Shape_Area 263 non-null float64
3 zone 263 non-null object
4 LocationID 263 non-null int32
5 borough 263 non-null object
6 geometry 263 non-null geometry
dtypes: float64(2), geometry(1), int32(2), object(2)
memory usage: 12.5+ KB
None
First few rows of taxi zones data:
--------------------------------------------------
OBJECTID Shape_Leng Shape_Area zone LocationID \
0 1 0.116357 0.000782 Newark Airport 1
1 2 0.433470 0.004866 Jamaica Bay 2
2 3 0.084341 0.000314 Allerton/Pelham Gardens 3
3 4 0.043567 0.000112 Alphabet City 4
4 5 0.092146 0.000498 Arden Heights 5
borough geometry
0 EWR POLYGON ((933100.918 192536.086, 933091.011 19...
1 Queens MULTIPOLYGON (((1033269.244 172126.008, 103343...
2 Bronx POLYGON ((1026308.77 256767.698, 1026495.593 2...
3 Manhattan POLYGON ((992073.467 203714.076, 992068.667 20...
4 Staten Island POLYGON ((935843.31 144283.336, 936046.565 144...
Columns in the shapefile:
--------------------------------------------------
['OBJECTID', 'Shape_Leng', 'Shape_Area', 'zone', 'LocationID', 'borough', 'geometry']
Zone Statistics:
--------------------------------------------------
Total number of zones: 263
Borough distribution:
borough
Queens 69
Manhattan 69
Brooklyn 61
Bronx 43
Staten Island 20
EWR 1
Name: count, dtype: int64
| OBJECTID | Shape_Leng | Shape_Area | zone | LocationID | borough | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.116357 | 0.000782 | Newark Airport | 1 | EWR | POLYGON ((933100.918 192536.086, 933091.011 19... |
| 1 | 2 | 0.433470 | 0.004866 | Jamaica Bay | 2 | Queens | MULTIPOLYGON (((1033269.244 172126.008, 103343... |
| 2 | 3 | 0.084341 | 0.000314 | Allerton/Pelham Gardens | 3 | Bronx | POLYGON ((1026308.77 256767.698, 1026495.593 2... |
| 3 | 4 | 0.043567 | 0.000112 | Alphabet City | 4 | Manhattan | POLYGON ((992073.467 203714.076, 992068.667 20... |
| 4 | 5 | 0.092146 | 0.000498 | Arden Heights | 5 | Staten Island | POLYGON ((935843.31 144283.336, 936046.565 144... |
| 5 | 6 | 0.150491 | 0.000606 | Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON ((966568.747 158679.855, 966615.256 15... |
| 6 | 7 | 0.107417 | 0.000390 | Astoria | 7 | Queens | POLYGON ((1010804.218 218919.641, 1011049.165 ... |
| 7 | 8 | 0.027591 | 0.000027 | Astoria Park | 8 | Queens | POLYGON ((1005482.276 221686.466, 1005304.898 ... |
| 8 | 9 | 0.099784 | 0.000338 | Auburndale | 9 | Queens | POLYGON ((1043803.993 216615.925, 1043849.708 ... |
| 9 | 10 | 0.099839 | 0.000436 | Baisley Park | 10 | Queens | POLYGON ((1044355.072 190734.321, 1044612.122 ... |
| 10 | 11 | 0.079211 | 0.000265 | Bath Beach | 11 | Brooklyn | POLYGON ((983945.076 158978.174, 984160.158 15... |
| 11 | 12 | 0.036661 | 0.000042 | Battery Park | 12 | Manhattan | POLYGON ((979908.772 196066.565, 979980.852 19... |
| 12 | 13 | 0.050281 | 0.000149 | Battery Park City | 13 | Manhattan | POLYGON ((980801.31 201248.869, 980697.386 200... |
| 13 | 14 | 0.175214 | 0.001382 | Bay Ridge | 14 | Brooklyn | POLYGON ((974794.238 174019.157, 975135.579 17... |
| 14 | 15 | 0.144336 | 0.000925 | Bay Terrace/Fort Totten | 15 | Queens | POLYGON ((1045882.129 229577.493, 1045886.158 ... |
| 15 | 16 | 0.141292 | 0.000872 | Bayside | 16 | Queens | POLYGON ((1048344.039 223210.621, 1048409.575 ... |
| 16 | 17 | 0.093523 | 0.000323 | Bedford | 17 | Brooklyn | POLYGON ((1000036.904 194829.434, 1000276.454 ... |
| 17 | 18 | 0.069800 | 0.000149 | Bedford Park | 18 | Bronx | POLYGON ((1016019.237 254945.129, 1015798.748 ... |
| 18 | 19 | 0.101825 | 0.000547 | Bellerose | 19 | Queens | POLYGON ((1060888.899 212784.64, 1061115.169 2... |
| 19 | 20 | 0.051440 | 0.000135 | Belmont | 20 | Bronx | POLYGON ((1016371.834 254183.401, 1016392.376 ... |
Now, if you look at the DataFrame created, you will see columns like: OBJECTID,Shape_Leng, Shape_Area, zone, LocationID, borough, geometry.
Now, the locationID here is also what we are using to mark pickup and drop zones in the trip records.
The geometric parameters like shape length, shape area and geometry are used to plot the zones on a map.
This can be easily done using the plot() method.
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 263 entries, 0 to 262 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 263 non-null int32 1 Shape_Leng 263 non-null float64 2 Shape_Area 263 non-null float64 3 zone 263 non-null object 4 LocationID 263 non-null int32 5 borough 263 non-null object 6 geometry 263 non-null geometry dtypes: float64(2), geometry(1), int32(2), object(2) memory usage: 12.5+ KB None
<Axes: >
Now, you have to merge the trip records and zones data using the location IDs.
3.1.10 [3 marks]
Merge the zones data into trip data using the locationID and PULocationID columns.
Merged Dataset Info: -------------------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 265416 entries, 0 to 265415 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 265416 non-null int64 1 tpep_pickup_datetime 265416 non-null datetime64[us] 2 tpep_dropoff_datetime 265416 non-null datetime64[us] 3 passenger_count 265416 non-null int64 4 trip_distance 265416 non-null float64 5 RatecodeID 265416 non-null int64 6 PULocationID 265416 non-null int64 7 DOLocationID 265416 non-null int64 8 payment_type 265416 non-null int64 9 fare_amount 265416 non-null float64 10 extra 265416 non-null float64 11 mta_tax 265416 non-null float64 12 tip_amount 265416 non-null float64 13 tolls_amount 265416 non-null float64 14 improvement_surcharge 265416 non-null float64 15 total_amount 265416 non-null float64 16 congestion_surcharge 265416 non-null float64 17 airport_fee 265416 non-null float64 18 is_refund 265416 non-null bool 19 fare_amount_scaled 265416 non-null float64 20 total_amount_scaled 265416 non-null float64 21 hour 265416 non-null int32 22 day_of_week 265416 non-null object 23 month 265416 non-null object 24 quarter 265416 non-null int32 25 trip_duration 265416 non-null float64 26 pickup_zone 262760 non-null object 27 pickup_borough 262760 non-null object 28 geometry 262760 non-null geometry dtypes: bool(1), datetime64[us](2), float64(13), geometry(1), int32(2), int64(6), object(4) memory usage: 54.9+ MB None Merge Statistics: -------------------------------------------------- Original trips: 265395 Merged trips: 265416 Trips with missing zone info: 2656 Sample of Merged Data: -------------------------------------------------- PULocationID pickup_zone pickup_borough fare_amount 0 138 LaGuardia Airport Queens 32.40 1 161 Midtown Center Manhattan 7.90 2 237 Upper East Side South Manhattan 11.40 3 143 Lincoln Square West Manhattan 6.50 4 66 DUMBO/Vinegar Hill Brooklyn 25.64 Pickups by Borough: -------------------------------------------------- pickup_borough Manhattan 234564 Queens 25776 Brooklyn 1919 Bronx 444 EWR 37 Staten Island 20 Name: count, dtype: int64
Final Dataset Info:
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265501 entries, 0 to 265500
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 VendorID 265501 non-null int64
1 tpep_pickup_datetime 265501 non-null datetime64[us]
2 tpep_dropoff_datetime 265501 non-null datetime64[us]
3 passenger_count 265501 non-null int64
4 trip_distance 265501 non-null float64
5 RatecodeID 265501 non-null int64
6 PULocationID 265501 non-null int64
7 DOLocationID 265501 non-null int64
8 payment_type 265501 non-null int64
9 fare_amount 265501 non-null float64
10 extra 265501 non-null float64
11 mta_tax 265501 non-null float64
12 tip_amount 265501 non-null float64
13 tolls_amount 265501 non-null float64
14 improvement_surcharge 265501 non-null float64
15 total_amount 265501 non-null float64
16 congestion_surcharge 265501 non-null float64
17 airport_fee 265501 non-null float64
18 is_refund 265501 non-null bool
19 fare_amount_scaled 265501 non-null float64
20 total_amount_scaled 265501 non-null float64
21 hour 265501 non-null int32
22 day_of_week 265501 non-null object
23 month 265501 non-null object
24 quarter 265501 non-null int32
25 trip_duration 265501 non-null float64
26 pickup_zone 262845 non-null object
27 pickup_borough 262845 non-null object
28 geometry 262845 non-null geometry
29 dropoff_zone 261680 non-null object
30 dropoff_borough 261680 non-null object
dtypes: bool(1), datetime64[us](2), float64(13), geometry(1), int32(2), int64(6), object(6)
memory usage: 59.0+ MB
None
Top 10 Pickup-Dropoff Borough Patterns:
--------------------------------------------------
pickup_borough dropoff_borough
Manhattan Manhattan 218570
Queens Manhattan 14720
Manhattan Queens 8088
Queens Queens 5906
Manhattan Brooklyn 5552
Queens Brooklyn 3806
Brooklyn Brooklyn 964
Manhattan Bronx 739
EWR 701
Brooklyn Manhattan 667
dtype: int64
Trips within same borough:
--------------------------------------------------
Number of trips: 225702
Percentage: 85.01%
Top Borough-to-Borough Patterns: -------------------------------------------------- Manhattan → Manhattan: 218,570 trips (84.16%) Queens → Manhattan: 14,720 trips (5.67%) Manhattan → Queens: 8,088 trips (3.11%) Queens → Queens: 5,906 trips (2.27%) Manhattan → Brooklyn: 5,552 trips (2.14%) Queens → Brooklyn: 3,806 trips (1.47%) Brooklyn → Brooklyn: 964 trips (0.37%) Manhattan → Bronx: 739 trips (0.28%) Manhattan → EWR: 701 trips (0.27%) Brooklyn → Manhattan: 667 trips (0.26%) Manhattan Pattern Analysis: -------------------------------------------------- Trips starting in Manhattan: 234,592 (88.36%) Trips ending in Manhattan: 234,492 (88.32%) Inter-borough Trip Patterns by Hour: -------------------------------------------------- hour 14 2530 15 2524 16 2420 17 2338 22 2328 Name: inter_borough, dtype: int64
3.1.11 [3 marks]
Group data by location IDs to find the total number of trips per location ID
Top 15 Pickup Locations:
----------------------------------------------------------------------
PULocationID pickup_zone pickup_borough pickup_count
132 JFK Airport Queens 13638
237 Upper East Side South Manhattan 12341
161 Midtown Center Manhattan 12269
236 Upper East Side North Manhattan 11083
162 Midtown East Manhattan 9425
138 LaGuardia Airport Queens 9018
186 Penn Station/Madison Sq West Manhattan 8871
230 Times Sq/Theatre District Manhattan 8741
142 Lincoln Square East Manhattan 8628
170 Murray Hill Manhattan 7790
163 Midtown North Manhattan 7580
239 Upper West Side South Manhattan 7412
234 Union Sq Manhattan 7130
48 Clinton East Manhattan 7098
68 East Chelsea Manhattan 6962
Top 15 Dropoff Locations:
----------------------------------------------------------------------
DOLocationID dropoff_zone dropoff_borough dropoff_count
236 Upper East Side North Manhattan 11785
237 Upper East Side South Manhattan 11005
161 Midtown Center Manhattan 10291
230 Times Sq/Theatre District Manhattan 8029
170 Murray Hill Manhattan 7836
162 Midtown East Manhattan 7500
142 Lincoln Square East Manhattan 7464
239 Upper West Side South Manhattan 7364
141 Lenox Hill West Manhattan 6834
68 East Chelsea Manhattan 6787
48 Clinton East Manhattan 6589
163 Midtown North Manhattan 6507
234 Union Sq Manhattan 6057
186 Penn Station/Madison Sq West Manhattan 5952
238 Upper West Side North Manhattan 5946
Location Summary Statistics:
----------------------------------------------------------------------
Total unique pickup locations: 238
Total unique dropoff locations: 252
Unused Locations:
----------------------------------------------------------------------
Locations with no pickups: 22
Locations with no dropoffs: 8
3.1.12 [2 marks]
Now, use the grouped data to add number of trips to the GeoDataFrame.
We will use this to plot a map of zones showing total trips per zone.
Zone Trip Statistics:
----------------------------------------------------------------------
Pickup Statistics:
count 263.000000
mean 999.577947
std 2444.154158
min 0.000000
25% 5.000000
50% 18.000000
75% 140.500000
max 13638.000000
Name: pickup_count, dtype: float64
Dropoff Statistics:
count 263.000000
mean 995.627376
std 2090.821382
min 0.000000
25% 26.500000
50% 73.000000
75% 605.000000
max 11785.000000
Name: dropoff_count, dtype: float64
Total Trips Statistics:
count 263.000000
mean 1995.205323
std 4467.543930
min 0.000000
25% 35.500000
50% 85.000000
75% 746.000000
max 23346.000000
Name: total_trips, dtype: float64
Top 10 Zones by Total Trip Volume:
----------------------------------------------------------------------
zone borough pickup_count dropoff_count \
236 Upper East Side South Manhattan 12341.0 11005.0
235 Upper East Side North Manhattan 11083.0 11785.0
160 Midtown Center Manhattan 12269.0 10291.0
161 Midtown East Manhattan 9425.0 7500.0
229 Times Sq/Theatre District Manhattan 8741.0 8029.0
131 JFK Airport Queens 13638.0 3123.0
141 Lincoln Square East Manhattan 8628.0 7464.0
169 Murray Hill Manhattan 7790.0 7836.0
185 Penn Station/Madison Sq West Manhattan 8871.0 5952.0
238 Upper West Side South Manhattan 7412.0 7364.0
total_trips
236 23346.0
235 22868.0
160 22560.0
161 16925.0
229 16770.0
131 16761.0
141 16092.0
169 15626.0
185 14823.0
238 14776.0
Merge Verification:
----------------------------------------------------------------------
Total zones: 263
Zones with trips: 253
Zones with no trips: 10
The next step is creating a color map (choropleth map) showing zones by the number of trips taken.
Again, you can use the zones.plot() method for this. Plot Method GPD
But first, you need to define the figure and axis for the plot.
fig, ax = plt.subplots(1, 1, figsize = (12, 10))
This function creates a figure (fig) and a single subplot (ax)
After setting up the figure and axis, we can proceed to plot the GeoDataFrame on this axis. This is done in the next step where we use the plot method of the GeoDataFrame.
You can define the following parameters in the zones.plot() method:
column = '',
ax = ax,
legend = True,
legend_kwds = {'label': "label", 'orientation': "<horizontal/vertical>"}
To display the plot, use plt.show().
3.1.13 [3 marks]
Plot a color-coded map showing zone-wise trips
Zones Sorted by Total Trip Volume: ---------------------------------------------------------------------------------------------------- # Zone Borough Pickups Dropoffs Total ---------------------------------------------------------------------------------------------------- 0. Upper East Side South Manhattan 12,341 11,005 23,346 1. Upper East Side North Manhattan 11,083 11,785 22,868 2. Midtown Center Manhattan 12,269 10,291 22,560 3. Midtown East Manhattan 9,425 7,500 16,925 4. Times Sq/Theatre District Manhattan 8,741 8,029 16,770 5. JFK Airport Queens 13,638 3,123 16,761 6. Lincoln Square East Manhattan 8,628 7,464 16,092 7. Murray Hill Manhattan 7,790 7,836 15,626 8. Penn Station/Madison Sq West Manhattan 8,871 5,952 14,823 9. Upper West Side South Manhattan 7,412 7,364 14,776 10. Midtown North Manhattan 7,580 6,507 14,087 11. East Chelsea Manhattan 6,962 6,787 13,749 12. Clinton East Manhattan 7,098 6,589 13,687 13. Union Sq Manhattan 7,130 6,057 13,187 14. Lenox Hill West Manhattan 6,351 6,834 13,185 15. LaGuardia Airport Queens 9,018 3,411 12,429 16. Midtown South Manhattan 6,143 5,702 11,845 17. East Village Manhattan 6,361 5,482 11,843 18. Upper West Side North Manhattan 5,161 5,946 11,107 19. Gramercy Manhattan 5,622 5,207 10,829 20. Yorkville West Manhattan 5,086 5,567 10,653 21. Sutton Place/Turtle Bay North Manhattan 5,072 5,549 10,621 22. Lenox Hill East Manhattan 5,186 5,391 10,577 23. West Village Manhattan 5,953 4,334 10,287 24. West Chelsea/Hudson Yards Manhattan 4,452 5,102 9,554 25. Flatiron Manhattan 4,348 3,774 8,122 26. Garment District Manhattan 4,290 3,608 7,898 27. Central Park Manhattan 4,465 3,236 7,701 28. Yorkville East Manhattan 3,523 4,061 7,584 29. TriBeCa/Civic Center Manhattan 3,600 3,970 7,570 30. UN/Turtle Bay South Manhattan 3,291 4,011 7,302 31. Greenwich Village North Manhattan 3,651 3,489 7,140 32. Lincoln Square West Manhattan 2,924 3,856 6,780 33. Kips Bay Manhattan 3,011 3,282 6,293 34. Greenwich Village South Manhattan 3,475 2,590 6,065 35. Lower East Side Manhattan 2,729 2,725 5,454 36. Little Italy/NoLiTa Manhattan 2,417 2,486 4,903 37. Meatpacking/West Village West Manhattan 2,289 2,614 4,903 38. East Harlem South Manhattan 1,594 3,075 4,669 39. Manhattan Valley Manhattan 2,230 2,404 4,634 40. Clinton West Manhattan 1,709 2,885 4,594 41. SoHo Manhattan 2,018 2,157 4,175 42. Battery Park City Manhattan 1,556 1,932 3,488 43. Financial District North Manhattan 1,353 2,013 3,366 44. Morningside Heights Manhattan 1,258 1,861 3,119 45. World Trade Center Manhattan 1,409 1,234 2,643 46. Hudson Sq Manhattan 1,265 1,317 2,582 47. East Harlem North Manhattan 558 1,831 2,389 48. Central Harlem Manhattan 614 1,617 2,231 49. Financial District South Manhattan 710 972 1,682 50. Two Bridges/Seward Park Manhattan 366 1,257 1,623 51. Bloomingdale Manhattan 693 849 1,542 52. Stuy Town/Peter Cooper Village Manhattan 348 1,059 1,407 53. East Elmhurst Queens 1,192 152 1,344 54. Alphabet City Manhattan 309 960 1,269 55. Seaport Manhattan 501 721 1,222 56. Central Harlem North Manhattan 234 943 1,177 57. Chinatown Manhattan 384 670 1,054 58. Long Island City/Hunters Point Queens 241 753 994 59. Washington Heights South Manhattan 179 793 972 60. Astoria Queens 138 745 883 61. Newark Airport EWR 37 794 831 62. Park Slope Brooklyn 102 718 820 63. Hamilton Heights Manhattan 204 614 818 64. Brooklyn Heights Brooklyn 143 670 813 65. Williamsburg (North Side) Brooklyn 101 702 803 66. Greenpoint Brooklyn 50 639 689 67. Williamsburg (South Side) Brooklyn 78 596 674 68. Sunnyside Queens 191 446 637 69. Downtown Brooklyn/MetroTech Brooklyn 164 417 581 70. Boerum Hill Brooklyn 109 464 573 71. Manhattanville Manhattan 162 343 505 72. Washington Heights North Manhattan 50 452 502 73. Crown Heights North Brooklyn 74 418 492 74. Fort Greene Brooklyn 94 372 466 75. Queensbridge/Ravenswood Queens 183 250 433 76. Jackson Heights Queens 66 365 431 77. Steinway Queens 59 372 431 78. Clinton Hill Brooklyn 45 376 421 79. Long Island City/Queens Plaza Queens 137 265 402 80. East Williamsburg Brooklyn 71 329 400 81. Forest Hills Queens 44 354 398 82. DUMBO/Vinegar Hill Brooklyn 78 320 398 83. Baisley Park Queens 81 252 333 84. Bedford Brooklyn 40 288 328 85. Bushwick South Brooklyn 29 281 310 86. South Ozone Park Queens 44 252 296 87. Old Astoria Queens 47 247 294 88. Stuyvesant Heights Brooklyn 27 246 273 89. Woodside Queens 51 218 269 90. Flatbush/Ditmas Park Brooklyn 44 223 267 91. Battery Park Manhattan 112 152 264 92. Prospect-Lefferts Gardens Brooklyn 51 202 253 93. Prospect Heights Brooklyn 18 228 246 94. Carroll Gardens Brooklyn 26 210 236 95. Cobble Hill Brooklyn 60 159 219 96. Bushwick North Brooklyn 23 192 215 97. Corona Queens 44 170 214 98. Corona Queens 44 170 214 99. Mott Haven/Port Morris Bronx 49 157 206 100. Flushing Queens 19 182 201 101. Jamaica Queens 46 154 200 102. Bay Ridge Brooklyn 19 178 197 103. Elmhurst Queens 37 159 196 104. Inwood Manhattan 20 173 193 105. Roosevelt Island Manhattan 13 151 164 106. Flushing Meadows-Corona Park Queens 58 104 162 107. Richmond Hill Queens 36 123 159 108. Crown Heights South Brooklyn 26 133 159 109. Springfield Gardens South Queens 32 118 150 110. Sunset Park West Brooklyn 16 134 150 111. Ridgewood Queens 9 137 146 112. East New York Brooklyn 34 110 144 113. Spuyten Duyvil/Kingsbridge Bronx 21 111 132 114. Briarwood/Jamaica Hills Queens 21 106 127 115. Windsor Terrace Brooklyn 6 118 124 116. South Williamsburg Brooklyn 9 113 122 117. Red Hook Brooklyn 25 96 121 118. Saint Albans Queens 22 99 121 119. South Jamaica Queens 47 73 120 120. Canarsie Brooklyn 29 87 116 121. West Concourse Bronx 27 89 116 122. Kew Gardens Queens 23 86 109 123. Rego Park Queens 9 99 108 124. Gowanus Brooklyn 17 89 106 125. Maspeth Queens 23 82 105 126. East Concourse/Concourse Village Bronx 18 84 102 127. Springfield Gardens North Queens 26 75 101 128. Ocean Hill Brooklyn 20 74 94 129. Riverdale/North Riverdale/Fieldston Bronx 7 87 94 130. Melrose South Bronx 18 71 89 131. Kew Gardens Hills Queens 10 75 85 132. East Flatbush/Remsen Village Brooklyn 20 64 84 133. Brownsville Brooklyn 28 55 83 134. Kensington Brooklyn 8 75 83 135. Middle Village Queens 9 73 82 136. Queens Village Queens 20 62 82 137. Flatlands Brooklyn 22 60 82 138. East Flatbush/Farragut Brooklyn 19 60 79 139. University Heights/Morris Heights Bronx 21 57 78 140. Midwood Brooklyn 12 63 75 141. Marine Park/Mill Basin Brooklyn 9 66 75 142. Hillcrest/Pomonok Queens 11 62 73 143. Jamaica Estates Queens 12 60 72 144. Borough Park Brooklyn 8 64 72 145. North Corona Queens 17 53 70 146. Co-Op City Bronx 18 51 69 147. Elmhurst/Maspeth Queens 9 60 69 148. Erasmus Brooklyn 14 55 69 149. Sheepshead Bay Brooklyn 16 51 67 150. Cypress Hills Brooklyn 6 59 65 151. Howard Beach Queens 5 60 65 152. Cambria Heights Queens 13 50 63 153. Soundview/Castle Hill Bronx 23 40 63 154. Bensonhurst West Brooklyn 9 52 61 155. Williamsbridge/Olinville Bronx 12 49 61 156. Mount Hope Bronx 15 46 61 157. Coney Island Brooklyn 16 44 60 158. Glendale Queens 3 56 59 159. Woodhaven Queens 8 51 59 160. Homecrest Brooklyn 10 49 59 161. Bayside Queens 3 54 57 162. Whitestone Queens 2 55 57 163. Far Rockaway Queens 14 43 57 164. Saint Michaels Cemetery/Woodside Queens 27 29 56 165. Columbia Street Brooklyn 6 49 55 166. Morrisania/Melrose Bronx 12 43 55 167. Fresh Meadows Queens 5 46 51 168. Laurelton Queens 5 46 51 169. Randalls Island Manhattan 10 41 51 170. Rosedale Queens 10 40 50 171. Highbridge Bronx 8 42 50 172. Murray Hill-Queens Queens 3 45 48 173. Van Nest/Morris Park Bronx 16 31 47 174. Norwood Bronx 8 38 46 175. Van Cortlandt Village Bronx 4 42 46 176. College Point Queens 10 35 45 177. Sunset Park East Brooklyn 4 40 44 178. Parkchester Bronx 18 25 43 179. Schuylerville/Edgewater Park Bronx 6 37 43 180. Bellerose Queens 5 36 41 181. Brooklyn Navy Yard Brooklyn 5 36 41 182. Bedford Park Bronx 9 31 40 183. Hammels/Arverne Queens 9 31 40 184. East New York/Pennsylvania Avenue Brooklyn 16 24 40 185. Brighton Beach Brooklyn 4 36 40 186. Pelham Parkway Bronx 8 31 39 187. Prospect Park Brooklyn 7 32 39 188. Madison Brooklyn 10 29 39 189. Claremont/Bathgate Bronx 11 27 38 190. Belmont Bronx 4 34 38 191. East Tremont Bronx 14 24 38 192. Westchester Village/Unionport Bronx 8 30 38 193. Kingsbridge Heights Bronx 11 27 38 194. Starrett City Brooklyn 10 26 36 195. Woodlawn/Wakefield Bronx 6 30 36 196. Soundview/Bruckner Bronx 6 30 36 197. Hunts Point Bronx 8 27 35 198. Dyker Heights Brooklyn 4 30 34 199. Douglaston Queens 2 30 32 200. West Farms/Bronx River Bronx 6 25 31 201. Bensonhurst East Brooklyn 10 21 31 202. Ozone Park Queens 3 27 30 203. Oakland Gardens Queens 6 24 30 204. Crotona Park East Bronx 11 18 29 205. Bay Terrace/Fort Totten Queens 3 26 29 206. Bronxdale Bronx 7 22 29 207. Eastchester Bronx 9 20 29 208. Hollis Queens 13 15 28 209. Longwood Bronx 9 18 27 210. Bath Beach Brooklyn 5 22 27 211. Queensboro Hill Queens 2 24 26 212. Rockaway Park Queens 3 22 25 213. Gravesend Brooklyn 7 18 25 214. Allerton/Pelham Gardens Bronx 9 14 23 215. Glen Oaks Queens 1 21 22 216. Auburndale Queens 3 19 22 217. Ocean Parkway South Brooklyn 1 20 21 218. East Flushing Queens 2 19 21 219. Fordham South Bronx 4 17 21 220. Manhattan Beach Brooklyn 5 15 20 221. Bronx Park Bronx 1 18 19 222. Highbridge Park Manhattan 0 13 13 223. Bloomfield/Emerson Hill Staten Island 6 7 13 224. Inwood Hill Park Manhattan 1 12 13 225. Heartland Village/Todt Hill Staten Island 1 11 12 226. Pelham Bay Bronx 0 12 12 227. Marble Hill Manhattan 2 9 11 228. City Island Bronx 1 8 9 229. Country Club Bronx 0 9 9 230. Forest Park/Highland Park Queens 2 6 8 231. Van Cortlandt Park Bronx 1 7 8 232. Arrochar/Fort Wadsworth Staten Island 3 4 7 233. Charleston/Tottenville Staten Island 2 5 7 234. West Brighton Staten Island 1 6 7 235. Saint George/New Brighton Staten Island 0 6 6 236. Great Kills Staten Island 2 4 6 237. Marine Park/Floyd Bennett Field Brooklyn 1 5 6 238. Arden Heights Staten Island 1 5 6 239. South Beach/Dongan Hills Staten Island 1 4 5 240. Willets Point Queens 0 5 5 241. Astoria Park Queens 0 4 4 242. Stapleton Staten Island 1 3 4 243. Westerleigh Staten Island 1 3 4 244. Breezy Point/Fort Tilden/Riis Beach Queens 0 4 4 245. Rossville/Woodrow Staten Island 0 3 3 246. Broad Channel Queens 0 3 3 247. Green-Wood Cemetery Brooklyn 0 3 3 248. Pelham Bay Park Bronx 0 2 2 249. Port Richmond Staten Island 1 1 2 250. Crotona Park Bronx 0 1 1 251. Mariners Harbor Staten Island 0 1 1 252. Freshkills Park Staten Island 0 1 1 253. New Dorp/Midland Beach Staten Island 0 0 0 254. Great Kills Park Staten Island 0 0 0 255. Grymes Hill/Clifton Staten Island 0 0 0 256. Rikers Island Bronx 0 0 0 257. Governor's Island/Ellis Island/Liberty Island Manhattan 0 0 0 258. Jamaica Bay Queens 0 0 0 259. Oakwood Staten Island 0 0 0 260. Governor's Island/Ellis Island/Liberty Island Manhattan 0 0 0 261. Eltingville/Annadale/Prince's Bay Staten Island 0 0 0 262. Governor's Island/Ellis Island/Liberty Island Manhattan 0 0 0 Summary Statistics: -------------------------------------------------- Total Zones: 263 Active Zones (with trips): 253 Inactive Zones: 10 Total Trips: 524,739.0 Average Trips per Zone: 1,995.2 Median Trips per Zone: 85.0
Here we have completed the temporal, financial and geographical analysis on the trip records.
Compile your findings from general analysis below:
You can consider the following points:
- Busiest hours, days and months
- Trends in revenue collected
- Trends in quarterly revenue
- How fare depends on trip distance, trip duration and passenger counts
- How tip amount depends on trip distance
- Busiest zones
3.2 Detailed EDA: Insights and Strategies¶
[50 marks]
Having performed basic analyses for finding trends and patterns, we will now move on to some detailed analysis focussed on operational efficiency, pricing strategies, and customer experience.
Operational Efficiency¶
Analyze variations by time of day and location to identify bottlenecks or inefficiencies in routes
3.2.1 [3 marks]
Identify slow routes by calculating the average time taken by cabs to get from one zone to another at different hours of the day.
Speed on a route X for hour Y = (distance of the route X / average trip duration for hour Y)
How does identifying high-traffic, high-demand routes help us?
3.2.2 [3 marks]
Calculate the number of trips at each hour of the day and visualise them. Find the busiest hour and show the number of trips for that hour.
Hourly Trip Analysis: -------------------------------------------------- Busiest Hour: 18:00 with 18,751 trips Top 5 Busiest Hours: hour trip_count 18 18751 17 17912 19 16744 16 16457 15 16436 Peak Hours (7-9 AM, 4-6 PM) account for 30.8% of daily trips
Remember, we took a fraction of trips. To find the actual number, you have to scale the number up by the sampling ratio.
3.2.3 [2 mark]
Find the actual number of trips in the five busiest hours
Actual Trip Estimates for Busiest Hours: ------------------------------------------------------------ Hour Sampled Trips Actual Trips ------------------------------------------------------------ 18:00 18,751 1,875,100 17:00 17,912 1,791,200 19:00 16,744 1,674,400 16:00 16,457 1,645,700 15:00 16,436 1,643,600
3.2.4 [3 marks]
Compare hourly traffic pattern on weekdays. Also compare for weekend.
Hourly Pattern Analysis: ------------------------------------------------------------ Weekday Peak Hours: hour 18 2851.0 17 2680.0 19 2512.2 Name: Weekday, dtype: float64 Weekend Peak Hours: hour 17 2256.0 18 2248.0 16 2215.0 Name: Weekend, dtype: float64 Weekday vs Weekend Comparison: Average weekday trips per hour: 1611 Average weekend trips per hour: 1505 Weekday trips are 7.0% higher than weekend trips
What can you infer from the above patterns? How will finding busy and quiet hours for each day help us?
3.2.5 [3 marks]
Identify top 10 zones with high hourly pickups. Do the same for hourly dropoffs. Show pickup and dropoff trends in these zones.
Top 10 Pickup Zones: ------------------------------------------------------------ JFK Airport 13,638 pickups Upper East Side South 12,341 pickups Midtown Center 12,269 pickups Upper East Side North 11,083 pickups Midtown East 9,425 pickups LaGuardia Airport 9,018 pickups Penn Station/Madison Sq West 8,871 pickups Times Sq/Theatre District 8,741 pickups Lincoln Square East 8,628 pickups Murray Hill 7,790 pickups Top 10 Dropoff Zones: ------------------------------------------------------------ Upper East Side North 11,785 dropoffs Upper East Side South 11,005 dropoffs Midtown Center 10,291 dropoffs Times Sq/Theatre District 8,029 dropoffs Murray Hill 7,836 dropoffs Midtown East 7,500 dropoffs Lincoln Square East 7,464 dropoffs Upper West Side South 7,364 dropoffs Lenox Hill West 6,834 dropoffs East Chelsea 6,787 dropoffs Peak Hours Analysis for Top Zones: ------------------------------------------------------------ JFK Airport: Peak hour: 15:00 with 964 pickups Upper East Side South: Peak hour: 17:00 with 950 pickups Midtown Center: Peak hour: 18:00 with 1080 pickups
3.2.6 [3 marks]
Find the ratio of pickups and dropoffs in each zone. Display the 10 highest (pickup/drop) and 10 lowest (pickup/drop) ratios.
Zones with Highest Pickup/Dropoff Ratios: -------------------------------------------------------------------------------- Zone Ratio Pickups Dropoffs Total Trips -------------------------------------------------------------------------------- East Elmhurst 7.84 1192 152 1344 JFK Airport 4.37 13638 3123 16761 LaGuardia Airport 2.64 9018 3411 12429 Penn Station/Madison Sq West 1.49 8871 5952 14823 Central Park 1.38 4465 3236 7701 West Village 1.37 5953 4334 10287 Greenwich Village South 1.34 3475 2590 6065 Midtown East 1.26 9425 7500 16925 Midtown Center 1.19 12269 10291 22560 Garment District 1.19 4290 3608 7898 Zones with Lowest Pickup/Dropoff Ratios: -------------------------------------------------------------------------------- Zone Ratio Pickups Dropoffs Total Trips -------------------------------------------------------------------------------- Newark Airport 0.05 37 794 831 Windsor Terrace 0.05 6 118 124 Ridgewood 0.07 9 137 146 Greenpoint 0.08 50 639 689 Prospect Heights 0.08 18 228 246 South Williamsburg 0.08 9 113 122 Roosevelt Island 0.09 13 151 164 Rego Park 0.09 9 99 108 Bushwick South 0.10 29 281 310 Flushing 0.10 19 182 201
3.2.7 [3 marks]
Identify zones with high pickup and dropoff traffic during night hours (11PM to 5AM)
Top Night Activity Zones (11 PM - 5 AM): -------------------------------------------------------------------------------- Zone Pickups Dropoffs Total -------------------------------------------------------------------------------- East Village 2313 1232 3545 West Village 1877 692 2569 Clinton East 1495 1036 2531 JFK Airport 2043 311 2354 Lower East Side 1448 664 2112 Times Sq/Theatre District 1246 640 1886 East Chelsea 880 875 1755 Gramercy 861 853 1714 Greenwich Village South 1234 367 1601 Penn Station/Madison Sq West 997 570 1567 Murray Hill 590 906 1496 Midtown South 906 589 1495 Flatiron 721 668 1389 Union Sq 806 526 1332 Yorkville West 537 772 1309 Key Insights: -------------------------------------------------------------------------------- Total night trips: 32,069 Percentage of all trips that occur at night: 12.1%
Now, let us find the revenue share for the night time hours and the day time hours. After this, we will move to deciding a pricing strategy.
3.2.8 [2 marks]
Find the revenue share for nighttime and daytime hours.
Revenue Analysis (Day vs Night): ------------------------------------------------------------ Day Revenue (6AM-10PM): $6,723,403.11 Night Revenue (11PM-5AM): $950,864.45 Total Revenue: $7,674,267.56 Average Revenue per Trip: ------------------------------------------------------------ Day: $28.80 per trip Night: $29.65 per trip Average Revenue per Hour: ------------------------------------------------------------ Day: $13183.14 per hour Night: $4527.93 per hour
Pricing Strategy¶
3.2.9 [2 marks]
For the different passenger counts, find the average fare per mile per passenger.
For instance, suppose the average fare per mile for trips with 3 passengers is 3 USD/mile, then the fare per mile per passenger will be 1 USD/mile.
Detailed Fare Analysis by Passenger Count:
--------------------------------------------------------------------------------
Passengers Avg Fare/Mile/Passenger Avg Distance Avg Fare Number of Trips
--------------------------------------------------------------------------------
1 $ 7.87 3.39 $ 19.13 206,305
2 $ 3.87 4.17 $ 22.17 38,676
3 $ 2.61 3.98 $ 21.91 9,642
4 $ 1.99 4.25 $ 23.16 5,342
5 $ 1.53 3.36 $ 18.94 3,302
6 $ 1.28 3.34 $ 18.84 2,234
Trip Distribution by Passenger Count:
--------------------------------------------------
1 passenger(s): 77.7%
2 passenger(s): 14.57%
3 passenger(s): 3.63%
4 passenger(s): 2.01%
5 passenger(s): 1.24%
6 passenger(s): 0.84%
3.2.10 [3 marks]
Find the average fare per mile by hours of the day and by days of the week
Hourly Fare Analysis: ------------------------------------------------------------ Hour Avg Fare/Mile Peak Hours ------------------------------------------------------------ 00:00 $ 6.67 01:00 $ 6.67 02:00 $ 6.62 03:00 $ 6.46 04:00 $ 5.97 05:00 $ 5.98 06:00 $ 6.30 07:00 $ 7.06 Morning Peak 08:00 $ 7.72 Morning Peak 09:00 $ 7.98 Morning Peak 10:00 $ 8.19 Morning Peak 11:00 $ 8.52 12:00 $ 8.62 13:00 $ 8.50 14:00 $ 8.47 15:00 $ 8.46 16:00 $ 8.47 Evening Peak 17:00 $ 8.42 Evening Peak 18:00 $ 8.23 Evening Peak 19:00 $ 7.92 Evening Peak 20:00 $ 7.28 21:00 $ 7.10 22:00 $ 6.95 23:00 $ 6.68 Daily Fare Analysis: ------------------------------------------------------------ Day Avg Fare/Mile % Difference from Weekday Avg ------------------------------------------------------------ Friday $ 7.84 -1.7% Monday $ 7.52 -5.7% Saturday $ 7.74 -2.9% Sunday $ 7.16 -10.2% Thursday $ 8.19 +2.8% Tuesday $ 8.11 +1.7% Wednesday $ 8.19 +2.8%
Work Week Average - Week Average ------> 0.14947186227420506
3.2.11 [3 marks]
Analyse the average fare per mile for the different vendors for different hours of the day
Vendor Comparison by Hour: -------------------------------------------------------------------------------- Hour Vendor 1 Vendor 2 Difference Trip Count (V1/V2) -------------------------------------------------------------------------------- 00:00 $ 6.48 $ 6.72 $ -0.25 1774/5733 01:00 $ 6.44 $ 6.73 $ -0.29 1133/3914 02:00 $ 6.58 $ 6.64 $ -0.06 799/2521 03:00 $ 6.26 $ 6.52 $ -0.27 529/1667 04:00 $ 5.89 $ 5.99 $ -0.10 386/1118 05:00 $ 6.06 $ 5.96 $ 0.10 468/1081 06:00 $ 6.31 $ 6.30 $ 0.01 1091/2574 07:00 $ 7.03 $ 7.07 $ -0.03 2219/5032 08:00 $ 7.61 $ 7.77 $ -0.17 2924/7131 09:00 $ 8.00 $ 7.97 $ 0.03 3204/8135 10:00 $ 8.04 $ 8.26 $ -0.22 3572/8731 11:00 $ 8.36 $ 8.59 $ -0.23 3815/9540 12:00 $ 8.60 $ 8.63 $ -0.03 4009/10492 13:00 $ 8.29 $ 8.59 $ -0.30 4313/10631 14:00 $ 8.50 $ 8.46 $ 0.04 4409/11602 15:00 $ 8.44 $ 8.47 $ -0.04 4464/11967 16:00 $ 8.39 $ 8.50 $ -0.11 4543/11911 17:00 $ 8.33 $ 8.46 $ -0.13 4925/12981 18:00 $ 8.21 $ 8.24 $ -0.03 4954/13793 19:00 $ 7.83 $ 7.95 $ -0.12 4246/12496 20:00 $ 7.22 $ 7.29 $ -0.07 3736/11193 21:00 $ 7.01 $ 7.13 $ -0.11 3652/11269 22:00 $ 6.82 $ 6.99 $ -0.17 3256/10565 23:00 $ 6.57 $ 6.71 $ -0.13 2588/8351 Overall Vendor Statistics: -------------------------------------------------------------------------------- Vendor 1: Average fare per mile: $7.80 Total trips: 71,009 Peak hour average: $7.74 Off-peak average: $7.81 Vendor 2: Average fare per mile: $7.86 Total trips: 194,428 Peak hour average: $7.85 Off-peak average: $7.87
3.2.12 [5 marks]
Compare the fare rates of the different vendors in a tiered fashion. Analyse the average fare per mile for distances upto 2 miles. Analyse the fare per mile for distances from 2 to 5 miles. And then for distances more than 5 miles.
Detailed Analysis by Distance Tier: -------------------------------------------------------------------------------- 0-2 miles Analysis: ---------------------------------------- Vendor 1: Average fare per mile: $9.49 Number of trips: 40,081.0 Average distance: 1.16 miles Average fare: $9.96 Vendor 2: Average fare per mile: $9.72 Number of trips: 104,307.0 Average distance: 1.16 miles Average fare: $10.16 2-5 miles Analysis: ---------------------------------------- Vendor 1: Average fare per mile: $6.28 Number of trips: 19,446.0 Average distance: 3.07 miles Average fare: $18.89 Vendor 2: Average fare per mile: $6.52 Number of trips: 54,393.0 Average distance: 3.02 miles Average fare: $19.25 5+ miles Analysis: ---------------------------------------- Vendor 1: Average fare per mile: $4.46 Number of trips: 11,482.0 Average distance: 11.17 miles Average fare: $47.76 Vendor 2: Average fare per mile: $4.50 Number of trips: 35,728.0 Average distance: 11.80 miles Average fare: $50.89 Vendor Price Comparison by Tier: -------------------------------------------------------------------------------- 0-2 miles: Vendor 1: $9.49/mile Vendor 2: $9.72/mile Difference: +2.4% 2-5 miles: Vendor 1: $6.28/mile Vendor 2: $6.52/mile Difference: +3.8% 5+ miles: Vendor 1: $4.46/mile Vendor 2: $4.50/mile Difference: +0.9%
Customer Experience and Other Factors¶
3.2.13 [5 marks]
Analyse average tip percentages based on trip distances, passenger counts and time of pickup. What factors lead to low tip percentages?
Tip Analysis by Distance Tier:
------------------------------------------------------------
tip_percentage payment_type
mean count <lambda>
distance_tier
0-2 miles 21.31 144388 80.53
2-5 miles 18.06 73839 85.15
5+ miles 15.98 47210 82.02
Tip Analysis by Passenger Count:
------------------------------------------------------------
tip_percentage payment_type
mean count <lambda>
passenger_count
1 19.61 206241 83.01
2 19.13 38676 79.64
3 18.47 9642 77.61
4 16.75 5342 71.53
5 20.36 3302 82.16
6 20.19 2234 82.77
Tip Analysis by Time Period:
------------------------------------------------------------
tip_percentage payment_type
mean count <lambda>
time_period
Night (0-6) 18.46 24788 83.55
Morning Peak (7-10) 18.80 40948 81.68
Midday (11-15) 18.38 75242 79.01
Evening Peak (16-19) 20.66 69849 82.82
Evening (20-23) 20.35 54610 85.00
Low Tip Analysis (Tips < 10%):
------------------------------------------------------------
Number of low tip trips: 72,089
Percentage of all trips: 27.2%
Characteristics of Low Tip Trips:
Average distance: 3.96 miles
Average fare: $21.50
Cash payment percentage: 61.7%
Additional analysis [optional]: Let's try comparing cases of low tips with cases of high tips to find out if we find a clear aspect that drives up the tipping behaviours
Detailed Comparison of Tip Categories:
--------------------------------------------------------------------------------
Avg Distance Avg Fare Avg Passengers % Card Payments \
tip_category
Low Tips (<10%) 3.95 21.44 1.40 34.69
Medium Tips (10-25%) 4.89 25.56 1.36 100.00
High Tips (>25%) 2.34 14.57 1.36 99.99
Number of Trips
tip_category
Low Tips (<10%) 72824
Medium Tips (10-25%) 79347
High Tips (>25%) 113266
Distribution of Tip Categories:
--------------------------------------------------------------------------------
Low Tips (<10%): 72,824 trips (27.4%)
High Tips (>25%): 113,266 trips (42.7%)
Medium Tips (10-25%): 79,347 trips (29.9%)
3.2.14 [3 marks]
Analyse the variation of passenger count across hours and days of the week.
Hourly Passenger Analysis: ------------------------------------------------------------ Hour Avg Passengers Trip Count Peak Hours ------------------------------------------------------------ 00:00 1.39 7507 01:00 1.43 5047 02:00 1.41 3320 03:00 1.40 2196 04:00 1.31 1504 05:00 1.27 1549 06:00 1.22 3665 07:00 1.26 7251 Morning Peak 08:00 1.29 10055 Morning Peak 09:00 1.31 11339 Morning Peak 10:00 1.35 12303 Morning Peak 11:00 1.35 13355 12:00 1.36 14501 13:00 1.36 14944 14:00 1.39 16011 15:00 1.40 16431 16:00 1.39 16454 Evening Peak 17:00 1.37 17906 Evening Peak 18:00 1.36 18747 Evening Peak 19:00 1.38 16742 Evening Peak 20:00 1.39 14929 21:00 1.40 14921 22:00 1.42 13821 23:00 1.42 10939 Daily Passenger Analysis: ------------------------------------------------------------ Day Avg Passengers Trip Count ------------------------------------------------------------ Monday 1.34 33060 Tuesday 1.32 38385 Wednesday 1.32 40663 Thursday 1.33 41610 Friday 1.39 39494 Saturday 1.45 38675 Sunday 1.45 33550 Additional Insights: ------------------------------------------------------------ Weekday vs Weekend Comparison: Weekdays: Average passengers: 1.34 Number of trips: 193,212 Weekends: Average passengers: 1.45 Number of trips: 72,225 Peak vs Off-peak Comparison: Peak Hours (7-10 AM, 4-7 PM): Average passengers: 1.35 Number of trips: 110,797 Off-peak Hours: Average passengers: 1.38 Number of trips: 154,640
3.2.15 [2 marks]
Analyse the variation of passenger counts across zones
Zone Analysis (Top 15 by Trip Volume):
--------------------------------------------------------------------------------
Avg Passengers Std Dev Trip Count \
pickup_zone
JFK Airport 1.50 0.90 13638
Upper East Side South 1.32 0.82 12341
Midtown Center 1.38 0.88 12269
Upper East Side North 1.35 0.88 11083
Midtown East 1.34 0.85 9425
LaGuardia Airport 1.37 0.84 9018
Penn Station/Madison Sq West 1.35 0.84 8871
Times Sq/Theatre District 1.48 0.92 8741
Lincoln Square East 1.36 0.81 8628
Murray Hill 1.34 0.86 7790
Midtown North 1.40 0.88 7580
Upper West Side South 1.36 0.86 7412
Union Sq 1.35 0.83 7130
Clinton East 1.40 0.90 7098
East Chelsea 1.41 0.91 6962
Avg Distance
pickup_zone
JFK Airport 15.93
Upper East Side South 1.82
Midtown Center 2.49
Upper East Side North 1.98
Midtown East 2.43
LaGuardia Airport 9.61
Penn Station/Madison Sq West 2.45
Times Sq/Theatre District 3.28
Lincoln Square East 2.28
Murray Hill 2.42
Midtown North 2.55
Upper West Side South 2.28
Union Sq 2.19
Clinton East 2.63
East Chelsea 2.58
Zone Type Analysis:
--------------------------------------------------------------------------------
Airport Zones:
mean count
pickup_zone
JFK Airport 1.50 13638
LaGuardia Airport 1.37 9018
Manhattan Business Districts:
mean count
pickup_zone
Midtown Center 1.38 12269
Times Sq/Theatre District 1.48 8741
Upper East Side South 1.32 12341
Correlation between trip distance and passenger count: 0.038
Is this correlation OK? Yes, because:
It makes logical sense that trip distance and passenger count would be largely independent
People don't necessarily travel in larger groups for longer distances
Short trips can have many passengers (e.g., group dinner)
Long trips can have single passengers (e.g., airport runs)
Passenger Count Distribution by Distance Tier:
------------------------------------------------------------
mean std count
distance_tier
0-2 miles 1.360397 0.861193 144388
2-5 miles 1.362992 0.857786 73839
5-10 miles 1.350116 0.840418 23775
10+ miles 1.477918 0.901558 23435
Find out how often surcharges/extra charges are applied to understand their prevalance
3.2.16 [5 marks]
Analyse the pickup/dropoff zones or times when extra charges are applied more frequently
Overall Surcharge Statistics:
------------------------------------------------------------
Congestion Surcharge:
Applied to 89.3% of trips
Average amount when applied: $2.23
Improvement Surcharge:
Applied to 100.0% of trips
Average amount when applied: $1.00
Mta Tax:
Applied to 99.1% of trips
Average amount when applied: $0.50
Extra:
Applied to 60.1% of trips
Average amount when applied: $1.59
Top 10 Zones with Highest Extra Charges:
------------------------------------------------------------
has_congestion has_extra
pickup_zone
City Island 100.000000 100.000000
Penn Station/Madison Sq West 98.162552 58.426333
Midtown East 98.100796 61.241379
Upper East Side South 97.941820 56.786322
Midtown Center 97.342897 63.998696
Battery Park 97.321429 37.500000
Midtown North 96.992084 62.058047
Upper East Side North 96.968330 52.142922
Union Sq 96.872370 64.095372
Garment District 96.689977 59.720280
Peak vs Off-peak Analysis:
------------------------------------------------------------
Peak Hours (7-10 AM, 4-7 PM):
congestion_surcharge extra
mean count mean count
hour
False 2.23 154640 1.3 154640
True 2.24 110797 2.0 110797
4 Conclusion¶
[15 marks]
4.1 Final Insights and Recommendations¶
[15 marks]
Conclude your analyses here. Include all the outcomes you found based on the analysis.
Based on the insights, frame a concluding story explaining suitable parameters such as location, time of the day, day of the week etc. to be kept in mind while devising a strategy to meet customer demand and optimise supply.
4.1.1 [5 marks]
Recommendations to optimize routing and dispatching based on demand patterns and operational inefficiencies
Recommendations to Optimize Routing and Dispatching:
Time-Based Route Optimization
- Avoid congested areas during 8-10 AM where average speed drops to 7.72 mph
- Utilize alternative routes during evening peak (4-7 PM) when speeds average 8.42 mph
- Take advantage of faster travel times during off-peak hours (average 8.62 mph at noon)
Distance-Based Efficiency
- Focus on short trips (0-2 miles) during peak hours for maximum efficiency
- Optimize long-distance routes (5+ miles) for airport runs
- Plan routes to minimize empty returns, especially for airport trips
Peak Hour Management
- Highest demand at 6 PM (18,751 trips) requires efficient route planning
- Morning peak (7-10 AM) needs strategic positioning near business districts
- Late night (11 PM-5 AM) requires focus on entertainment districts
Operational Improvements
- Monitor and avoid slowest routes (e.g., Midtown South to Garment District: 4.33 mph)
- Utilize data on inter-borough travel patterns for better route planning
- Implement real-time traffic monitoring for dynamic route adjustments
Zone-Based Routing
- Prioritize high-volume zones (JFK Airport: 13,638 trips)
- Optimize routes between popular zone pairs
- Create efficient coverage patterns for Manhattan (88.36% of all pickups)
4.1.2 [5 marks]
Suggestions on strategically positioning cabs across different zones to make best use of insights uncovered by analysing trip trends across time, days and months.
Strategic Positioning Recommendations:
Time-Based Positioning
- Morning (7-10 AM): Position near residential areas like Upper East Side
- Midday (11 AM-3 PM): Focus on business districts and shopping areas
- Evening (4-7 PM): Concentrate around office areas and transport hubs
- Night (11 PM-5 AM): Position near entertainment districts (12.1% of revenue)
Zone-Based Distribution High-Priority Zones (based on trip volume):
- JFK Airport (13,638 trips)
- Upper East Side South (12,341 trips)
- Midtown Center (12,269 trips)
- Upper East Side North (11,083 trips)
- LaGuardia Airport (9,018 trips)
Weekly Pattern Optimization Weekdays:
- Business districts during rush hours
- Airport coverage during off-peak
Weekends:
- Entertainment districts
- Shopping areas
- Tourist locations
Seasonal Adjustments
- Q2 (26.67% revenue): Increase coverage
- Q3 (22.77% revenue): Adjust for lower demand
- Q4 (27.02% revenue): Maximum deployment
Special Event Considerations
- Monitor and adjust for local events
- Coordinate with airport schedules
- Account for weather conditionsations__
4.1.3 [5 marks]
Propose data-driven adjustments to the pricing strategy to maximize revenue while maintaining competitive rates with other vendors.
Data-Driven Pricing Strategy Adjustments:
- Distance-Based Price Structure
SHORT TRIPS (0-2 miles)
- Rate: $9.49 per mile
- Best for: Local travel, business district trips
MEDIUM TRIPS (2-5 miles)
- Rate: $6.28 per mile
- Best for: Inter-borough, cross-town travel
LONG TRIPS (5+ miles)
- Rate: $4.46 per mile
- Best for: Airport runs, outer borough service
- Time-Based Pricing
PEAK HOURS (4-7 PM)
- Implement surge pricing
- Monitor competitor rates
- Adjust based on demand
BUSINESS HOURS (9 AM-5 PM)
- Standard rate plus congestion charge
- Corporate client rates
- Volume-based discounts
LATE NIGHT SERVICE
- Special rate structure
- Safety premium
- Guaranteed service fee
- Zone-Based Pricing
MANHATTAN BUSINESS DISTRICT
- Premium rates during peak hours
- Corporate account options
- Regular client discounts
AIRPORT SERVICE
- Fixed rate packages
- Terminal-specific pricing
- Flight delay adjustment rates
- Special Condition Adjustments
- Congestion pricing (89.3% trips affected)
- Weather condition surcharges
- Special event rates
- Payment Incentives
- Card payment bonuses
- Loyalty program benefits
- Corporate account rateslty program benefits
- Corporate account rateslty program benefits
- Corporate account rateslty program benefits
- Corporate account rates Loyalty program benefits
- Corporate account rates
Additional Strategic Recommendations:
Technical and Data Implementation
- Predictive analytics for demand forecasting
- Real-time heat maps for demand visualization
- Machine learning for route optimization
- Driver performance tracking systems
- Automated surge pricing mechanisms
Market Segmentation Strategy Business Travelers:
- Premium guaranteed pickup service
- Corporate account management
- Express route options
Tourists:
- Airport-attraction packages
- Multi-stop tour rates
- Simplified fixed pricing
Regular Commuters:
- Loyalty program
- Subscription-based services
- Peak hour priority
Advanced Operational Metrics Performance Tracking:
- Route efficiency scores
- Driver rating system
- Vehicle utilization rates
- Peak hour performance metrics
Quality Control:
- Customer feedback integration
- Service level monitoring
- Response time optimization
- Complaint resolution tracking
Competitive Edge Initiatives Service Differentiation:
- Premium vehicle options
- Pre-scheduled rides
- Multi-language support
- Special event services
Technology Integration:
- Mobile app enhancement
- Digital payment options
- Real-time tracking
- Automated customer support